分箱数据和包含结果

Posted

技术标签:

【中文标题】分箱数据和包含结果【英文标题】:binning data and inclusive result 【发布时间】:2012-05-15 09:43:05 【问题描述】:

假设我已经在这样的结构中分箱了一些数据:

data = (1,1): [...] # list of float,
        (1,2): [...],
        (1,3): [...],
        (2,1): [...],
        ... 

这里我只有两个轴用于分箱,但假设我有 N 个。现在假设例如我有 N=3 轴,我想要第二个 bin 为 1 的数据,所以我想要一个函数

(None, 1, None) -> [(1, 1, 1), (1, 1, 2), (1, 1, 3), ...
                    (2, 1, 1), (2, 1, 2), (2, 1, 3), ...]

所以我可以使用itertools.chain 作为结果

你知道每个轴的范围来自:

axes_ranges = [(1, 10), (1, 8), (1, 3)]

其他例子:

(None, 1, 2) -> [(1, 1, 2), (2, 1, 2), (3, 1, 2), ...]
(None, None, None) -> all the combinations
(1,2,3) -> [(1,2,3)]

【问题讨论】:

【参考方案1】:

看起来很像你重新发明***。您可能想要使用的是 numpy.ndarray:

    import numpy as np
    >>> x = np.arange(0,27)
    >>> x
    array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
    17, 18, 19, 20, 21, 22, 23, 24, 25, 26])
    >>> x.reshape(3,3,3)
    array([[[ 0,  1,  2],
    [ 3,  4,  5],
    [ 6,  7,  8]],

    [[ 9, 10, 11],
     [12, 13, 14],
     [15, 16, 17]],

    [[18, 19, 20],
     [21, 22, 23],
     [24, 25, 26]]])

    >>> x[0]
    array([[0, 1, 2],
    [3, 4, 5],
    [6, 7, 8]])
    >>> x[:,1,:]
    array([[ 3,  4,  5],
    [12, 13, 14],
    [21, 22, 23]])
    >>> x[:,1,1]
    array([ 4, 13, 22])

这可以有 N 个维度。在示例中,索引是三维的,您可以将其视为具有 x[a,b,c] = x[layer,row,column] 的立方体。使用“:”作为索引仅表示“全部”

【讨论】:

这很好,现在的问题是2:1.如何将(None, 1, 1)翻译成x[:, 1, 1]? : 是哪种符号? 2.我的数据不是int(或float):对于每个bin,我都有一个float集合(一个列表) 浮点数的列表是否等长?【参考方案2】:

嗯,怎么样:

import itertools

def combinations_with_fixpoint(iterables, *args):
    return itertools.product(*([x] if x else y for x, y in zip(args, iterables)))


axes_ranges = [(1, 7), (1, 8), (77, 79)]

combs = combinations_with_fixpoint(
    itertools.starmap(range, axes_ranges),
    None, 5, None
)

for p in combs:
    print p

# (1, 5, 77)
# (1, 5, 78)
# (2, 5, 77)
# (2, 5, 78)
# (3, 5, 77)
# (3, 5, 78)
# (4, 5, 77)
# (4, 5, 78)
# (5, 5, 77)
# (5, 5, 78)
# (6, 5, 77)
# (6, 5, 78)    

也许只是传递一个列表以允许多个“固定点”:

def combinations_with_fixpoint(iterables, *args):
    return itertools.product(*(x or y for x, y in zip(args, iterables)))

combs = combinations_with_fixpoint(
    itertools.starmap(range, axes_ranges),
    None, [5, 6], None
)

【讨论】:

【参考方案3】:
binning = [[0, 0.1, 0.2], [0, 10, 20], [-1, -2, -3]]
range_binning = [(1, len(x) + 1) for x in binning]

def expand_bin(thebin):
    def expand_bin_index(thebin, freeindex, rangebin):
        """
        thebin = [1, None, 3]
        freeindex = 1
        rangebin = [4,5]
        -> [[1, 4, 3], [1, 5, 3]]
        """
        result = []
        for r in rangebin:
            newbin = thebin[:]
            newbin[freeindex] = r
            result.append(newbin)
        return result

    tmp = [thebin]
    indexes_free = [i for i,aa in enumerate(thebin) if aa is None]
    for index_free in indexes_free:
        range_index = range(*(range_binning[index_free]))
        new_tmp = []
        for t in tmp:
            for expanded in expand_bin_index(t, index_free, range_index):
                new_tmp.append(expanded)
        tmp = new_tmp
    return tmp

inputs = ([None, 1, 2], [None, None, 3], [None, 1, None], [3, 2, 1], [None, None, None])
for i in inputs:
    print "%s-> %s" % (i, expand_bin(i))

【讨论】:

以上是关于分箱数据和包含结果的主要内容,如果未能解决你的问题,请参考以下文章

Python/Pandas 分箱数据 Timedelta

数据挖掘实验数据预处理等深分箱与等宽分箱

熊猫时间序列重新采样,分箱似乎关闭

特征处理方法

使用分箱 X 值 Python 制作条形图

Mongo中的分箱和制表(唯一/计数)