在Python中将数字分组为范围[重复]

Posted

技术标签:

【中文标题】在Python中将数字分组为范围[重复]【英文标题】:Group numbers into ranges in Python [duplicate] 【发布时间】:2017-07-13 22:12:22 【问题描述】:

我想自动对不同区域的数字进行分组。根据图中(样本数据集),我们可以看到它们是三个数字所在的区域,即 [0,100]、[650,750]、[1220, 1300] .我只想指出这些地区。可以有任意数量的这样的区域。我们需要自动找到号码。这些区域的数量和这些区域的范围。两个区域之间的距离将非常大。 有什么办法可以在 Python 中做到这一点?

Sample data = [69,  8,  30, 45, 89, 61, 80, 45, 9,  18, 19, 11, 1255,   1299,   1296,   1293,   1287,   1250,   1265,   1291,   1281,   1250,   1286,   1286,   1251,   1287,   1266,   1288,   1254,   1260,   1260,   1254,   1267,   1299,   1273,   1250,   1300,   1250,   1279,   1255,   1293,   1292,   1278,   1277,   1252,   1299,   1278,   1258,   1268,   1274,   1285,   1258,   1279,   1270,   1278,   1286,   1278,   1253,   1267,   1300,   1295,   1298,   1285,   1288,   1274,   1272,   1252,   1256,   1283,   1289,   1251,   1258,   1253,   1257,   1297,   1269,   1292,   1253,   1273,   1281,   1251,   1280,   1253,   1274,   1275,   1287,   1296,   1298,   1296,   1291,   1284,   1261,   1267,   1290,   1273,   1281,   1263,   1270,   1264,   1269,   1278,   1284,   67, 8,  40, 59, 97, 64, 45, 72, 45, 90, 94, 7,  33, 58, 97, 97, 1252,   1297,   1265,   1278,   1272,   1252,   1258,   1261,   1287,   1260,   1260,   1258,   1280,   1263,   1256,   1296,   1269,   1270,   1296,   1282,   696,    678,    665,    700,    700,    691,    689,    688,    650,    663,    662,    698,    655,    660,    662,    684,    690,    657,    653,    663,    670,    691,    687,    675,    694,    670,    676,    659,    661,    664,    664,    689,    683,    675,    687,    691,    676,    659,    689,    657,    659,    656,    654,    679,    669,    687,    666,    662,    691,    1260,   1276,   1252,   1295,   1257,   1277,   1281,   1257,   1295,   1269,   1265,   1290,   1266,   1269,   1286,   1254,   1260,   1265,   1290,   1294,   1286,   1279,   1254,   1256,   1276,   1285,   1282,   1251,   1282,   1261,   1253,   56, 74, 85, 94, 18, 83, 38, 80, 8,  4,  78, 43, 7,  79, 68, 78, 1275,   1250,   1268,   1297,   1284,   1255,   1294,   1262,   1250,   1252,   680,    693,    677,    676,    670,    653,    670,    661,    658,    695,    665,    671,    656,    686,    662,    691,    675,    658,    671,    650,    667,    653,    652,    686,    667,    682,    694,    654,    689,    682,    667,    658,    651,    652,    692,    652,    655,    651,    650,    698,    655,    650,    679,    672,    697,    696,    696,    683,    1277,   1264,   1274,   1260,   1285,   1285,   1283,   1259,   1260,   1288,   1281,   1284,   1281,   1257,   1285,   1295,   1273,   1264,   1283,   1284,   1300,   1299,   1257,   1297,   1254,   1257,   1270,   1257,   1295,   34, 5,  73, 42, 27, 36, 91, 85, 19, 50, 34, 21, 73, 38, 18, 73]

【问题讨论】:

在这种情况下你想要什么输出?你能提供一个数据样本(不是图片)吗? 如果可能,我希望将其作为输出:[[0,100], [650,750], [1220, 1300]]。这些是所有数据所在的范围。基本上有一个很大的数据列表,其中很少有。位于 0,100 范围内,然后有很大的差距,然后很少。介于 650 到 750 之间,并且在 1220 到 1300 之间存在较大差距之后,类似的数据很少。 【参考方案1】:

我按照@schwobaseggl 的建议引用了Unsupervised clustering with unknown number of clusters ,并根据我的需要稍微更改了代码。 这是新代码:

import numpy
import scipy.cluster.hierarchy as hcluster

temp_data = [31,68,74,46,47,83,29,11,9,52,1272,1288,1297,1285,1294,1251,1265,1257,1280,1265,1292,1297,1271,1273,1253,1273,1291,1251,1295,1298,1264,1281,1294,1280,1250,1279,1298,1290,1294,1299,1266,1260,1298,1292,1280,1259,1266,1276,1253,1252,1271,1280,1284,1266,1254,1259,1291,1268,1253,1298,1288,1271,1298,1300,1274,1294,1263,1298,1270,1254,1266,1269,1283,1285,1286,1276,1257,1266,1272,1298,1261,1251,1272,1260,1291,1269,1260,1294,1287,1256,1253,1284,1269,1287,1292,1269,1272,1275,1250,1289,56,35,19,80,47,22,92,8,10,24,87,76,60,63,64,0,1295,1268,1280,1281,1277,1300,1278,1273,1250,1296,1266,1269,1282,1281,1272,1260,1292,1272,1253,1255,1299,1269,1268,1294,1250,1299,1292,1254,1281,1289,1259,1290,1271,1280,1272,1300,1258,1290,1289,1300,1299,1261,1300,1276,1290,1299,1280,1267,1283,1282,1269,1260,1285,1252,1250,1263,1297,1300,1292,1266,1260,1263,1292,1296,1289,1297,1251,1261,1250,1294,1278,1284,1291,1281,1269,1261,1257,1267,1265,1288,1291,1257,1296,1251,1260,1272,1294,1285,1269,1283,1297,1287,1253,1292,1299,1295,1286,1288,1283,1290,20,73,81,6,49,88,96,61,49,94,57,16,61,16,17,19,1280,1257,1259,1277,1257,1262,1263,1280,1292,1250,1287,1272,1258,1253,1285,1285,1257,1291,1273,1260,1267,1250,1280,1281,1263,1269,1292,1250,1282,1263,1274,1288,1296,1266,1291,1271,1273,1281,1261,1289,1269,1287,1296,1283,1280,1298,1259,1270,1259,1289,1269,1284,1295,1297,1256,1300,1281,1296,1284,1288,1285,1296,1277,1251,1279,1295,1281,1264,1280,1263,69,8,30,45,89,61,80,45,9,18,19,11,1255,1299,1296,1293,1287,1250,1265,1291,1281,1250,1286,1286,1251,1287,1266,1288,1254,1260,1260,1254,1267,1299,1273,1250,1300,1250,1279,1255,1293,1292,1278,1277,1252,1299,1278,1258,1268,1274,1285,1258,1279,1270,1278,1286,1278,1253,1267,1300,1295,1298,1285,1288,1274,1272,1252,1256,1283,1289,1251,1258,1253,1257,1297,1269,1292,1253,1273,1281,1251,1280,1253,1274,1275,1287,1296,1298,1296,1291,1284,1261,1267,1290,1273,1281,1263,1270,1264,1269,1278,1284,67,8,40,59,97,64,45,72,45,90,94,7,33,58,97,97,1252,1297,1265,1278,1272,1252,1258,1261,1287,1260,1260,1258,1280,1263,1256,1296,1269,1270,1296,1282,696,678,665,700,700,691,689,688,650,663,662,698,655,660,662,684,690,657,653,663,670,691,687,675,694,670,676,659,661,664,664,689,683,675,687,691,676,659,689,657,659,656,654,679,669,687,666,662,691,1260,1276,1252,1295,1257,1277,1281,1257,1295,1269,1265,1290,1266,1269,1286,1254,1260,1265,1290,1294,1286,1279,1254,1256,1276,1285,1282,1251,1282,1261,1253,56,74,85,94,18,83,38,80,8,4,78,43,7,79,68,78,1275,1250,1268,1297,1284,1255,1294,1262,1250,1252,680,693,677,676,670,653,670,661,658,695,665,671,656,686,662,691,675,658,671,650,667,653,652,686,667,682,694,654,689,682,667,658,651,652,692,652,655,651,650,698,655,650,679,672,697,696,696,683,1277,1264,1274,1260,1285,1285,1283,1259,1260,1288,1281,1284,1281,1257,1285,1295,1273,1264,1283,1284,1300,1299,1257,1297,1254,1257,1270,1257,1295,34,5,73,42,27,36,91,85,19,50,34,21,73,38,18,73]

ndata = [[td, td] for td in temp_data]
data = numpy.array(ndata)

# clustering
thresh = (11.0/100.0) * (max(temp_data) - min(temp_data))  #Threshold 11% of the total range of data

clusters = hcluster.fclusterdata(data, thresh, criterion="distance")

total_clusters = max(clusters)

clustered_index = []
for i in range(total_clusters):
    clustered_index.append([])

for i in range(len(clusters)):
    clustered_index[clusters[i] - 1].append(i)

clustered_range = []
for x in clustered_index:
    clustered_index_x = [temp_data[y] for y in x]
    clustered_range.append((min(clustered_index_x) , max(clustered_index_x)))

print clustered_range

我已选择阈值 (thres) 作为数据总范围的 11%

所以这个数据集的输出是:

[(0, 97), (1250, 1300), (650, 700)]

【讨论】:

【参考方案2】:

您可能正在寻找k-means clustering。

from typing import Tuple, Iterable, Sequence, List, Dict, DefaultDict
from random import sample
from math import fsum, sqrt
from collections import defaultdict
from functools import partial

Point = Tuple[int, ...]
Centroid = Point

def mean(data: Iterable[float]) -> float:
    'Accurate arithmetic mean'
    data = list(data)
    return fsum(data) / len(data)

def dist(p: Point, q: Point, sqrt=sqrt, fsum=fsum, zip=zip) -> float:
    'Euclidean distance'
    return sqrt(fsum((x - y) ** 2.0 for x, y in zip(p, q)))

def assign_data(centroids: Sequence[Centroid], data: Iterable[Point]) -> Dict[Centroid, Sequence[Point]]:
    'Assign data the closest centroid'
    d = defaultdict(list)             # type: DefaultDict[Point, List[Point]]
    for p in data:
        centroid = min(centroids, key=partial(dist, p))  # type: Point
        d[centroid].append(p)
    return dict(d)

def compute_centroids(groups: Iterable[Sequence[Point]]) -> List[Centroid]:
    'Compute the centroid of each group'
    return [tuple(map(mean, zip(*pts))) for pts in groups]

def k_means(data: Iterable[Point], k:int=2, iterations:int=10) -> List[Point]:
    'Return k-centroids for the data'
    data = list(data)
    centroids = sample(data, k)
    for i in range(iterations):
        labeled = assign_data(centroids, data)
        centroids = compute_centroids(labeled.values())
    return centroids

这里适用于您的问题:

data = [69, 8, 30, 45, 89, 61, 80, 45, 9, 18, 19, 11, 1255, 1299,
        1296, 1293, 1287, 1250, 1265, 1291, 1281, 1250, 1286,
        1286, 1251, 1287, 1266, 1288, 1254, 1260, 1260, 1254,
        1267, 1299, 1273, 1250, 1300, 1250, 1279, 1255, 1293,
        1292, 1278, 1277, 1252, 1299, 1278, 1258, 1268, 1274,
        1285, 1258, 1279, 1270, 1278, 1286, 1278, 1253, 1267,
        1300, 1295, 1298, 1285, 1288, 1274, 1272, 1252, 1256,
        1283, 1289, 1251, 1258, 1253, 1257, 1297, 1269, 1292,
        1253, 1273, 1281, 1251, 1280, 1253, 1274, 1275, 1287,
        1296, 1298, 1296, 1291, 1284, 1261, 1267, 1290, 1273,
        1281, 1263, 1270, 1264, 1269, 1278, 1284, 67, 8, 40, 59,
        97, 64, 45, 72, 45, 90, 94, 7, 33, 58, 97, 97, 1252, 1297,
        1265, 1278, 1272, 1252, 1258, 1261, 1287, 1260, 1260,
        1258, 1280, 1263, 1256, 1296, 1269, 1270, 1296, 1282, 696,
        678, 665, 700, 700, 691, 689, 688, 650, 663, 662, 698,
        655, 660, 662, 684, 690, 657, 653, 663, 670, 691, 687,
        675, 694, 670, 676, 659, 661, 664, 664, 689, 683, 675,
        687, 691, 676, 659, 689, 657, 659, 656, 654, 679, 669,
        687, 666, 662, 691, 1260, 1276, 1252, 1295, 1257, 1277,
        1281, 1257, 1295, 1269, 1265, 1290, 1266, 1269, 1286,
        1254, 1260, 1265, 1290, 1294, 1286, 1279, 1254, 1256,
        1276, 1285, 1282, 1251, 1282, 1261, 1253, 56, 74, 85, 94,
        18, 83, 38, 80, 8, 4, 78, 43, 7, 79, 68, 78, 1275, 1250,
        1268, 1297, 1284, 1255, 1294, 1262, 1250, 1252, 680, 693,
        677, 676, 670, 653, 670, 661, 658, 695, 665, 671, 656,
        686, 662, 691, 675, 658, 671, 650, 667, 653, 652, 686,
        667, 682, 694, 654, 689, 682, 667, 658, 651, 652, 692,
        652, 655, 651, 650, 698, 655, 650, 679, 672, 697, 696,
        696, 683, 1277, 1264, 1274, 1260, 1285, 1285, 1283, 1259,
        1260, 1288, 1281, 1284, 1281, 1257, 1285, 1295, 1273,
        1264, 1283, 1284, 1300, 1299, 1257, 1297, 1254, 1257,
        1270, 1257, 1295, 34, 5, 73, 42, 27, 36, 91, 85, 19, 50,
        34, 21, 73, 38, 18, 73]

points = [(x,) for x in data]
centroids = k_means(points, k=3, iterations=100)
clusters = assign_data(centroids, points).values()
for cluster in clusters:
    print(f'min(cluster)[0] to max(cluster)[0]')

这个输出:

4 to 97
650 to 700
1250 to 1300

【讨论】:

@schwobaseggl 具有未知数量的集群的无监督聚类处理 3 维数据。我只想要我的数字存在的范围。 但在 K-Means 聚类中,K 的数量是预定义的。但在我的数据集中,我不知道我将拥有多少个集群, 那么你的问题定义不明确。没有约束,有一些简单的解决方案(所有点都在一个集群中,或者每个点都是它自己的大小为 1 的集群)。 此外,k-means 代码适用于任何维度。该示例给出了三个维度,但它只适用于一个维度。【参考方案3】:

一种可能性是使用 sklearns 高斯混合模型来查找集群。由于只有一个维度,也许这有点矫枉过正,但无论如何它应该可以工作。

下面是一个例子:

import numpy as np
from sklearn import mixture

# Generate some example data
centers = [50, 700, 1250]  # Approximate
y = []
for i in range(20):
    y.append(np.random.randn(100) * 50 + centers[np.random.randint(len(centers))])
y = np.c_[np.concatenate(y)]

N = 3  # Clusters
gmix = mixture.GaussianMixture(n_components=3, covariance_type='full')
gmix.fit(y)  # Now it thinks it is trained

for i in range(N):
    center = gmix.means_[i][0]
    std = np.sqrt(gmix.covariances_[i][0])
    print "Range %d: %.0f to %.0f" % (i + 1, center - std, center + std)

它应该输出类似(使用 IPython 2.7):

Range 1: 1199 to 1298
Range 2: 649 to 752
Range 3: 4 to 102

这当然取决于它找到的集群,这可能并不总是一个好的解决方案。

【讨论】:

但您已经预定义了 N = 3 # 个集群。但我们一开始不会有任何这样的信息。我们需要自动找出数据集中有多少这样的区域。 好吧,在我给出答案之后你修改了你的问题,所以我很难知道你也想找到集群的数量。但例如,您可以使用贝叶斯信息标准 (BIC) 分数来确定这一点。

以上是关于在Python中将数字分组为范围[重复]的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Python Pandas 中扩展存储为单个值的可迭代对象? (又名反向分组)[重复]

如何按范围分组,或有条件地从查询结果中选择

SQL查询将连续范围的数字分组到不同的分组集中

Oracle - 按类别分组,日期范围[重复]

excel如何不重复分组

在python中将列表转换为嵌套列表