使用 pandas cut 对值进行分组

Posted

技术标签:

【中文标题】使用 pandas cut 对值进行分组【英文标题】:Grouping values using pandas cut 【发布时间】:2016-09-09 23:56:15 【问题描述】:

我正在尝试将几个 csv 文件中的值分组到 XML 文件 (groups.xml) 中的 bin 中。我有以下代码在一定程度上有效,但没有达到我的预期:

import os, sys
import glob
import pandas as pd
import xml.etree.cElementTree as ET

def xml_parse():
    try:
        os.chdir("path/to/files")
        filename = [file1 for file1 in glob.glob("*.csv")]
        filename = [i.split('.', 1)[0] for i in filename]
        #filename = '\n'.join(filename)
        os.chdir('..')
        output = []
        doc = ET.parse("groups.xml").getroot()
        for root_ele in doc.findall('Groups'):
            tag_ele = root_ele.find('GroupID').text
            for name in filename:
                if name == tag_ele.lower():
                    for root_ele1 in root_ele.findall('groupname'):
                        displayname = root_ele1.find('Name').text
                        minval = root_ele1.find('min').text
                        mininc = root_ele1.find('minInc').text
                        maxvalue = root_ele1.find('max')
                        maxinclusive = root_ele1.find('maxInc')
                        lists = []
                        frame = pd.DataFrame()
                        fname = "path/to/files" + name + ".csv"
                        df = pd.read_csv(fname, index_col=None, header=None)
                        lists.append(df)
                        frame = pd.concat(lists)
                        if maxvalue is not None:
                            maxval = maxvalue.text
                            if maxinclusive is not None:
                                maxinc = maxinclusive.text
                                df['bin'] = pd.cut(frame[1], [float(minval),float(maxval)], right= maxinc, include_lowest= mininc)
                                out = str(pd.concat([df['bin'], frame[1]], axis=1))
                                out = out.split("\n")[2:]
                                for a in out:
                                    print a
                        else:
                            df['bin'] = pd.cut(frame[1], [float(minval)], include_lowest= mininc)
                            out = str(pd.concat([df['bin'], frame[1]], axis=1))  
                            out = out.split("\n")[2:]
                            for a in out:
                                print a
            break
    except AttributeError:
        pass

当前输出:

1   NaN  10.18
2   NaN  25.16
3   NaN  44.48
4   NaN  85.24
5   NaN  36.71
6   NaN  77.09
7   NaN  81.88
8   NaN  22.92
9   NaN  44.31
10  NaN  15.79
1   [10, 18]  10.18
2        NaN  25.16
3        NaN  44.48
4        NaN  85.24
5        NaN  36.71
6        NaN  77.09
7        NaN  81.88
8        NaN  22.92
9        NaN  44.31
10  [10, 18]  15.79
1        NaN  10.18
2   [18, 35]  25.16
3        NaN  44.48
4        NaN  85.24
5        NaN  36.71
6        NaN  77.09
7        NaN  81.88
8   [18, 35]  22.92
9        NaN  44.31
10       NaN  15.79
1        NaN  10.18
2        NaN  25.16
3   [35, 50]  44.48
4        NaN  85.24
5   [35, 50]  36.71
6        NaN  77.09
7        NaN  81.88
8        NaN  22.92
9   [35, 50]  44.31
10       NaN  15.79
1   NaN  10.18
2   NaN  25.16
3   NaN  44.48
4   NaN  85.24
5   NaN  36.71
6   NaN  77.09
7   NaN  81.88
8   NaN  22.92
9   NaN  44.31
10  NaN  15.79
1   NaN  10.18
2   NaN  25.16
3   NaN  44.48
4   NaN  85.24
5   NaN  36.71
6   NaN  77.09
7   NaN  81.88
8   NaN  22.92
9   NaN  44.31
10  NaN  15.79

出现错误:

Traceback (most recent call last):
  File "groups.py", line 69, in <module>
    xml_parse()
  File "groups.py", line 44, in xml_parse
    df['bin'] = pd.cut(frame[1], [float(minval)], include_lowest= mininc)
  File "C:\Python27\lib\site-packages\pandas\tools\tile.py", line 113, in cut
    include_lowest=include_lowest)
  File "C:\Python27\lib\site-packages\pandas\tools\tile.py", line 203, in _bins_to_cuts
    include_lowest=include_lowest)
  File "C:\Python27\lib\site-packages\pandas\tools\tile.py", line 252, in _format_levels
    levels[0] = '[' + levels[0][1:]
IndexError: list index out of range

预期输出:

1   [10, 18]  10.18
2   [18, 35]  25.16
3   [35, 50]  44.48
4   [>= 75] 85.24 #however >=75 can be represented
5   [35, 50]  36.71
6   [>= 75] 77.09
7   [>= 75] 81.88
8   [18, 35]  22.92
9   [35, 50]  44.31
10  [10, 18]  15.79

【问题讨论】:

【参考方案1】:

开始于:

df:

     val1  val2
0     NaN    10
1   10.18     1
2   25.16     1
3   44.48     1
4   85.24     1
5   36.71     1
6   77.09     1
7   81.88     1
8   22.92     1
9   44.31     1
10  15.79     1

xml = """
<metaGroups>
    <Groups>
        <GroupID>age</GroupID>
        <description>age</description>
        <groupname>
            <Name>0 - &lt;10</Name>
            <min>0</min>
            <minInc>TRUE</minInc>
            <max>10</max>
            <maxInc>FALSE</maxInc>
        </groupname>
        <groupname>
            <Name>10 - &lt;18</Name>
            <min>10</min>
            <minInc>TRUE</minInc>
            <max>18</max>
            <maxInc>FALSE</maxInc>
        </groupname>
        <groupname>
            <Name>18 - &lt;35</Name>
            <min>18</min>
            <minInc>TRUE</minInc>
            <max>35</max>
            <maxInc>FALSE</maxInc>
        </groupname>
        <groupname>
            <Name>35 - &lt;50</Name>
            <min>35</min>
            <minInc>TRUE</minInc>
            <max>50</max>
            <maxInc>FALSE</maxInc>
        </groupname>
        <groupname>
            <Name>50 - &lt;65</Name>
            <min>50</min>
            <minInc>TRUE</minInc>
            <max>65</max>
            <maxInc>FALSE</maxInc>
        </groupname>
        <groupname>
            <Name>65 - &lt;75</Name>
            <min>65</min>
            <minInc>TRUE</minInc>
            <max>75</max>
            <maxInc>FALSE</maxInc>
        </groupname>
        <groupname>
            <Name>&amp;ge;75</Name>
            <min>75</min>
            <minInc>TRUE</minInc>
        </groupname>
    </Groups>
</metaGroups>
"""

您可以使用BeautifulSoup提取bin参数,构造标签并应用pd.cut()

from bs4 import BeautifulSoup as Soup
from itertools import chain

soup = Soup(xml, 'html.parser')

bins = []
for message in soup.findAll('groupname'):
    min = message.find('min').text
    try:
        max = message.find('max').text
        bins.append([min, max])
    except:
        bins.append([min]) # For max bin

我们现在有

bins

[['0', '10'], ['10', '18'], ['18', '35'], ['35', '50'], ['50', '65'], ['65', '75'], ['75']]

接下来,我们将展平listlist,去掉重复项并添加一个上限:

labels = bins
bins = list(np.unique(np.fromiter(chain.from_iterable(bins), dtype='int')))
last = bins[-1]
bins.append(int(df.val1.max() + 1))

产生:

[0, 10, 18, 35, 50, 65, 75, 86]

构建标签:

labels = ['[0 - 1]'.format(label[0], label[1]) if len(label) > 1 else '[ >  ]'.format(label[0]) for label in labels]

并使用pd.cut():

df['binned'] = pd.cut(df.val1, bins=bins, labels=labels)

产生:

     val1  val2     binned
1   10.18     1  [10 - 18]
2   25.16     1  [18 - 35]
3   44.48     1  [35 - 50]
4   85.24     1    [>= 75]
5   36.71     1  [35 - 50]
6   77.09     1    [>= 75]
7   81.88     1    [>= 75]
8   22.92     1  [18 - 35]
9   44.31     1  [35 - 50]
10  15.79     1  [10 - 18]

【讨论】:

感谢您的回答。 'bins' 对我没有任何回报。 我没有包含xml 字符串,请参阅更新的答案。 我使用了我的 xml 文件。看来我没有正确阅读。它现在工作。非常感谢!

以上是关于使用 pandas cut 对值进行分组的主要内容,如果未能解决你的问题,请参考以下文章

我如何在熊猫中分组然后对值求和? [复制]

Python之pandas数据整理及分组统计

使用 XSLT 1.0 对值进行分组

使用 Scala 根据 RDD 中的多个键列对值进行分组的最快方法是啥? [复制]

按键对哈希进行分组并对值求和

pandas使用groupby函数基于指定分组变量对dataframe数据进行分组使用size函数计算分组数据中每个分组样本的个数