read_csv 缺少/不完整的标题或不规则的列数
Posted
技术标签:
【中文标题】read_csv 缺少/不完整的标题或不规则的列数【英文标题】:read_csv with missing/incomplete header or irregular number of columns 【发布时间】:2016-03-25 07:33:27 【问题描述】:我有一个 file.csv
大约 15k 行,看起来像这样
SAMPLE_TIME, POS, OFF, HISTOGRAM
2015-07-15 16:41:56, 0-0-0-0-3, 1, 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,
2015-07-15 16:42:55, 0-0-0-0-3, 1, 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,
2015-07-15 16:43:55, 0-0-0-0-3, 1, 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,
2015-07-15 16:44:56, 0-0-0-0-3, 1, 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0
我希望将它导入到pandas.DataFrame
,并为没有标题的列赋予任何随机值,如下所示:
SAMPLE_TIME, POS, OFF, HISTOGRAM 1 2 3 4 5 6
2015-07-15 16:41:56, 0-0-0-0-3, 1, 2, 0, 5, 59, 4, 0, 0,
2015-07-15 16:42:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 6, 0, nan
2015-07-15 16:43:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 7, nan nan
2015-07-15 16:44:56, 0-0-0-0-3, 1, 2, 0, 5, 0, 0, 2, nan
这是不可能导入的,因为我尝试了不同的解决方案,例如提供specific a header,但仍然没有乐趣,我能够使其工作的唯一方法是在.csv
中手动添加标题文件。这有点违背自动化的目的!
然后我尝试了this solution: 这样做
lines=list(csv.reader(open('file.csv')))
header, values = lines[0], lines[1:]
它正确读取文件给我一个~15k元素values
的列表,每个元素都是一个字符串列表,其中每个字符串都是从文件中正确解析的数据字段,但是当我尝试这样做时:
data = h:v for h,v in zip (header, zip(*values))
df = pd.DataFrame.from_dict(data)
或者这个:
data2 = h:v for h,v in zip (str(xrange(16)), zip(*values))
df2 = pd.DataFrame.from_dict(data)
然后非标题列消失,列的顺序完全混合。任何可能的解决方案的想法?
【问题讨论】:
【参考方案1】:您可以根据实际第一行的长度创建列:
from tempfile import TemporaryFile
with open("out.txt") as f, TemporaryFile("w+") as t:
h, ln = next(f), len(next(f).split(","))
header = h.strip().split(",")
f.seek(0), next(f)
header += range(ln)
print(pd.read_csv(f, names=header))
这会给你:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \
0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0
1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0
2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0
3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0
4 5 ... 13 14 15 16 17 18 19 20 21 22
0 0 0 ... 0 0 0 0 0 NaN NaN NaN NaN NaN
1 0 0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0 0 ... 4 0 0 0 NaN NaN NaN NaN NaN NaN
3 0 0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
[4 rows x 27 columns]
或者你可以在传递给 pandas 之前清理文件:
import pandas as pd
from tempfile import TemporaryFile
with open("in.csv") as f, TemporaryFile("w+") as t:
for line in f:
t.write(line.replace(" ", ""))
t.seek(0)
ln = len(line.strip().split(","))
header = t.readline().strip().split(",")
header += range(ln)
print(pd.read_csv(t,names=header))
这给了你:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 4 5 ... 11 \
0 2015-07-1516:41:56 0-0-0-0-3 1 2 0 5 59 0 0 0 ... 0
1 2015-07-1516:42:55 0-0-0-0-3 1 0 0 5 9 0 0 0 ... 0
2 2015-07-1516:43:55 0-0-0-0-3 1 0 0 5 5 0 0 0 ... 0
3 2015-07-1516:44:56 0-0-0-0-3 1 2 0 5 0 0 0 0 ... 0
12 13 14 15 16 17 18 19 20
0 0 0 0 0 0 0 NaN NaN NaN
1 50 0 NaN NaN NaN NaN NaN NaN NaN
2 0 4 0 0 0 NaN NaN NaN NaN
3 6 0 0 0 0 NaN NaN NaN NaN
[4 rows x 25 columns]
或删除列将全部娜娜:
print(pd.read_csv(f, names=header).dropna(axis=1,how="all"))
给你:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \
0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0
1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0
2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0
3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0
4 5 ... 8 9 10 11 12 13 14 15 16 17
0 0 0 ... 2 0 0 0 0 0 0 0 0 0
1 0 0 ... 2 0 0 0 50 0 NaN NaN NaN NaN
2 0 0 ... 2 0 0 0 0 4 0 0 0 NaN
3 0 0 ... 2 0 0 0 6 0 0 0 0 NaN
[4 rows x 22 columns]
【讨论】:
【参考方案2】:您可以将HISTOGRAM
列拆分为新的DataFrame
,并将concat
拆分为原始列。
print df
SAMPLE_TIME, POS, OFF, \
0 2015-07-15 16:41:56 0-0-0-0-3, 1,
1 2015-07-15 16:42:55 0-0-0-0-3, 1,
2 2015-07-15 16:43:55 0-0-0-0-3, 1,
3 2015-07-15 16:44:56 0-0-0-0-3, 1,
HISTOGRAM
0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,
1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,
2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,
3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0
#create new dataframe from column HISTOGRAM
h = pd.DataFrame([ x.split(',') for x in df['HISTOGRAM'].tolist()])
print h
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 2 0 5 59 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
1 0 0 5 9 0 0 0 0 0 2 0 0 0 50 0 None None None None
2 0 0 5 5 0 0 0 0 0 2 0 0 0 0 4 0 0 0 None
3 2 0 5 0 0 0 0 0 0 2 0 0 0 6 0 0 0 0 None None
#append to original, rename 0 column
df = pd.concat([df, h], axis=1).rename(columns=0:'HISTOGRAM')
print df
HISTOGRAM HISTOGRAM 1 2 3 4 5 ... 10 \
0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 2 0 5 59 0 0 ... 0
1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 0 0 5 9 0 0 ... 0
2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 0 0 5 5 0 0 ... 0
3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0 2 0 5 0 0 0 ... 0
11 12 13 14 15 16 17 18 19
0 0 0 0 0 0 0 0 0
1 0 0 50 0 None None None None
2 0 0 0 4 0 0 0 None
3 0 0 6 0 0 0 0 None None
[4 rows x 24 columns]
【讨论】:
【参考方案3】:假设您的数据位于名为 foo.csv 的文件中,您可以执行以下操作。这是针对 Pandas 0.17 测试的
df = pd.read_csv('foo.csv', names=['sample_time', 'pos', 'off', 'histogram', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17'], skiprows=1)
【讨论】:
【参考方案4】:那么这个怎么样。我从您的示例数据中制作了一个 csv。
当我导入行时:
with open('test.csv','rb') as f:
lines = list(csv.reader(f))
headers, values =lines[0],lines[1:]
要生成漂亮的标题名称,请使用以下行:
headers = [i or ind for ind, i in enumerate(headers)]
所以由于(我假设)csv 的工作方式,标题应该有一堆空字符串值。空字符串的计算结果为 False,因此此推导返回每列的编号列,没有标题。
然后只做一个df:
df = pd.DataFrame(values,columns=headers)
看起来像:
11: SAMPLE_TIME POS OFF HISTOGRAM 4 5 6 7 8 9 \
0 15/07/2015 16:41 0-0-0-0-3 1 2 0 5 59 0 0 0
1 15/07/2015 16:42 0-0-0-0-3 1 0 0 5 9 0 0 0
2 15/07/2015 16:43 0-0-0-0-3 1 0 0 5 5 0 0 0
3 15/07/2015 16:44 0-0-0-0-3 1 2 0 5 0 0 0 0
... 12 13 14 15 16 17 18 19 20 21
0 ... 2 0 0 0 0 0 0 0 0 0
1 ... 2 0 0 0 50 0
2 ... 2 0 0 0 0 4 0 0 0
3 ... 2 0 0 0 6 0 0 0 0
[4 rows x 22 columns]
【讨论】:
Windows 7 上的 Python 2.7.10、Anaconda 2.1.0 64 位。Pandas 0.17.1、csv.1.0。我不明白你的怀疑。 gist.github.com/gregroberts/a6e6040c045ea9130fee 所以输入在一个单元格中包含所有这些值。我明白我的错误了。 是的,第一个例子是输入有一大堆问题以上是关于read_csv 缺少/不完整的标题或不规则的列数的主要内容,如果未能解决你的问题,请参考以下文章