使用 pandas cut 对值进行分组
Posted
技术标签:
【中文标题】使用 pandas cut 对值进行分组【英文标题】:Grouping values using pandas cut 【发布时间】:2016-09-09 23:56:15 【问题描述】:我正在尝试将几个 csv
文件中的值分组到 XML
文件 (groups.xml
) 中的 bin 中。我有以下代码在一定程度上有效,但没有达到我的预期:
import os, sys
import glob
import pandas as pd
import xml.etree.cElementTree as ET
def xml_parse():
try:
os.chdir("path/to/files")
filename = [file1 for file1 in glob.glob("*.csv")]
filename = [i.split('.', 1)[0] for i in filename]
#filename = '\n'.join(filename)
os.chdir('..')
output = []
doc = ET.parse("groups.xml").getroot()
for root_ele in doc.findall('Groups'):
tag_ele = root_ele.find('GroupID').text
for name in filename:
if name == tag_ele.lower():
for root_ele1 in root_ele.findall('groupname'):
displayname = root_ele1.find('Name').text
minval = root_ele1.find('min').text
mininc = root_ele1.find('minInc').text
maxvalue = root_ele1.find('max')
maxinclusive = root_ele1.find('maxInc')
lists = []
frame = pd.DataFrame()
fname = "path/to/files" + name + ".csv"
df = pd.read_csv(fname, index_col=None, header=None)
lists.append(df)
frame = pd.concat(lists)
if maxvalue is not None:
maxval = maxvalue.text
if maxinclusive is not None:
maxinc = maxinclusive.text
df['bin'] = pd.cut(frame[1], [float(minval),float(maxval)], right= maxinc, include_lowest= mininc)
out = str(pd.concat([df['bin'], frame[1]], axis=1))
out = out.split("\n")[2:]
for a in out:
print a
else:
df['bin'] = pd.cut(frame[1], [float(minval)], include_lowest= mininc)
out = str(pd.concat([df['bin'], frame[1]], axis=1))
out = out.split("\n")[2:]
for a in out:
print a
break
except AttributeError:
pass
当前输出:
1 NaN 10.18
2 NaN 25.16
3 NaN 44.48
4 NaN 85.24
5 NaN 36.71
6 NaN 77.09
7 NaN 81.88
8 NaN 22.92
9 NaN 44.31
10 NaN 15.79
1 [10, 18] 10.18
2 NaN 25.16
3 NaN 44.48
4 NaN 85.24
5 NaN 36.71
6 NaN 77.09
7 NaN 81.88
8 NaN 22.92
9 NaN 44.31
10 [10, 18] 15.79
1 NaN 10.18
2 [18, 35] 25.16
3 NaN 44.48
4 NaN 85.24
5 NaN 36.71
6 NaN 77.09
7 NaN 81.88
8 [18, 35] 22.92
9 NaN 44.31
10 NaN 15.79
1 NaN 10.18
2 NaN 25.16
3 [35, 50] 44.48
4 NaN 85.24
5 [35, 50] 36.71
6 NaN 77.09
7 NaN 81.88
8 NaN 22.92
9 [35, 50] 44.31
10 NaN 15.79
1 NaN 10.18
2 NaN 25.16
3 NaN 44.48
4 NaN 85.24
5 NaN 36.71
6 NaN 77.09
7 NaN 81.88
8 NaN 22.92
9 NaN 44.31
10 NaN 15.79
1 NaN 10.18
2 NaN 25.16
3 NaN 44.48
4 NaN 85.24
5 NaN 36.71
6 NaN 77.09
7 NaN 81.88
8 NaN 22.92
9 NaN 44.31
10 NaN 15.79
出现错误:
Traceback (most recent call last):
File "groups.py", line 69, in <module>
xml_parse()
File "groups.py", line 44, in xml_parse
df['bin'] = pd.cut(frame[1], [float(minval)], include_lowest= mininc)
File "C:\Python27\lib\site-packages\pandas\tools\tile.py", line 113, in cut
include_lowest=include_lowest)
File "C:\Python27\lib\site-packages\pandas\tools\tile.py", line 203, in _bins_to_cuts
include_lowest=include_lowest)
File "C:\Python27\lib\site-packages\pandas\tools\tile.py", line 252, in _format_levels
levels[0] = '[' + levels[0][1:]
IndexError: list index out of range
预期输出:
1 [10, 18] 10.18
2 [18, 35] 25.16
3 [35, 50] 44.48
4 [>= 75] 85.24 #however >=75 can be represented
5 [35, 50] 36.71
6 [>= 75] 77.09
7 [>= 75] 81.88
8 [18, 35] 22.92
9 [35, 50] 44.31
10 [10, 18] 15.79
【问题讨论】:
【参考方案1】:开始于:
df:
val1 val2
0 NaN 10
1 10.18 1
2 25.16 1
3 44.48 1
4 85.24 1
5 36.71 1
6 77.09 1
7 81.88 1
8 22.92 1
9 44.31 1
10 15.79 1
和
xml = """
<metaGroups>
<Groups>
<GroupID>age</GroupID>
<description>age</description>
<groupname>
<Name>0 - <10</Name>
<min>0</min>
<minInc>TRUE</minInc>
<max>10</max>
<maxInc>FALSE</maxInc>
</groupname>
<groupname>
<Name>10 - <18</Name>
<min>10</min>
<minInc>TRUE</minInc>
<max>18</max>
<maxInc>FALSE</maxInc>
</groupname>
<groupname>
<Name>18 - <35</Name>
<min>18</min>
<minInc>TRUE</minInc>
<max>35</max>
<maxInc>FALSE</maxInc>
</groupname>
<groupname>
<Name>35 - <50</Name>
<min>35</min>
<minInc>TRUE</minInc>
<max>50</max>
<maxInc>FALSE</maxInc>
</groupname>
<groupname>
<Name>50 - <65</Name>
<min>50</min>
<minInc>TRUE</minInc>
<max>65</max>
<maxInc>FALSE</maxInc>
</groupname>
<groupname>
<Name>65 - <75</Name>
<min>65</min>
<minInc>TRUE</minInc>
<max>75</max>
<maxInc>FALSE</maxInc>
</groupname>
<groupname>
<Name>&ge;75</Name>
<min>75</min>
<minInc>TRUE</minInc>
</groupname>
</Groups>
</metaGroups>
"""
您可以使用BeautifulSoup
提取bin
参数,构造标签并应用pd.cut()
:
from bs4 import BeautifulSoup as Soup
from itertools import chain
soup = Soup(xml, 'html.parser')
bins = []
for message in soup.findAll('groupname'):
min = message.find('min').text
try:
max = message.find('max').text
bins.append([min, max])
except:
bins.append([min]) # For max bin
我们现在有
bins
[['0', '10'], ['10', '18'], ['18', '35'], ['35', '50'], ['50', '65'], ['65', '75'], ['75']]
接下来,我们将展平list
的list
,去掉重复项并添加一个上限:
labels = bins
bins = list(np.unique(np.fromiter(chain.from_iterable(bins), dtype='int')))
last = bins[-1]
bins.append(int(df.val1.max() + 1))
产生:
[0, 10, 18, 35, 50, 65, 75, 86]
构建标签:
labels = ['[0 - 1]'.format(label[0], label[1]) if len(label) > 1 else '[ > ]'.format(label[0]) for label in labels]
并使用pd.cut()
:
df['binned'] = pd.cut(df.val1, bins=bins, labels=labels)
产生:
val1 val2 binned
1 10.18 1 [10 - 18]
2 25.16 1 [18 - 35]
3 44.48 1 [35 - 50]
4 85.24 1 [>= 75]
5 36.71 1 [35 - 50]
6 77.09 1 [>= 75]
7 81.88 1 [>= 75]
8 22.92 1 [18 - 35]
9 44.31 1 [35 - 50]
10 15.79 1 [10 - 18]
【讨论】:
感谢您的回答。 'bins' 对我没有任何回报。 我没有包含xml
字符串,请参阅更新的答案。
我使用了我的 xml 文件。看来我没有正确阅读。它现在工作。非常感谢!以上是关于使用 pandas cut 对值进行分组的主要内容,如果未能解决你的问题,请参考以下文章
使用 Scala 根据 RDD 中的多个键列对值进行分组的最快方法是啥? [复制]
pandas使用groupby函数基于指定分组变量对dataframe数据进行分组使用size函数计算分组数据中每个分组样本的个数