df.to_csv 结构化输出

Posted

技术标签:

【中文标题】df.to_csv 结构化输出【英文标题】:df.to_csv structuring the output 【发布时间】:2016-11-23 14:50:27 【问题描述】:

我正在尝试将输出写入csv,但我得到了不同的格式。

为了获得干净的输出,我需要做些什么改变。

代码:

import pandas as pd 
from datetime import datetime
import csv

df = pd.read_csv('one_hour.csv')
df.columns = ['date', 'startTime', 'endTime', 'day', 'count', 'unique']

count_med = df.groupby(['date'])[['count']].median()
unique_med = df.groupby(['date'])[['unique']].median()
date_count = df['date'].nunique()
#print count_med
#print unique_med

cols = ['date_count', 'count_med', 'unique_med']
outf = pd.DataFrame([[date_count, count_med, unique_med]], columns = cols)
outf.to_csv('date_med.csv', index=False, header=False)

输入:大数据文件中只有几行。

2004-01-05,21:00:00,22:00:00,Mon,16553,783
2004-01-05,22:00:00,23:00:00,Mon,18944,790
2004-01-05,23:00:00,00:00:00,Mon,17534,750
2004-01-06,00:00:00,01:00:00,Tue,17262,747
2004-01-06,01:00:00,02:00:00,Tue,19072,777
2004-01-06,02:00:00,03:00:00,Tue,18275,785
2004-01-06,03:00:00,04:00:00,Tue,13589,757
2004-01-06,04:00:00,05:00:00,Tue,16053,735
2004-01-06,05:00:00,06:00:00,Tue,11440,636

输出

63,"              count
date               
2004-01-05  10766.0
2004-01-06  11530.0
2004-01-07  11270.0
2004-01-08  14819.5
2004-01-09  12933.5
2004-01-10  10088.0
2004-01-11  10923.0
2004-02-03  14760.5
...             ...
2004-02-07  10131.5
2004-02-08  11184.0

[63 rows x 1 columns]","            unique
date              
2004-01-05   633.0
2004-01-06   741.0
2004-01-07   752.5
2004-02-03   779.5
...            ...
2004-02-07   643.5

[63 rows x 1 columns]"

但预期的输出不应该是这样的。

预期输出:四舍五入的值以及日期

2004-01-05,10766,633 
2004-01-06,11530,741
2004-01-07,11270,752

【问题讨论】:

请以原始 CSV 格式发布示例输入数据集和所需的输出/数据集 【参考方案1】:

试试这个:

cols = ['date', 'startTime', 'endTime', 'day', 'count', 'unique']

df = pd.read_csv(fn, header=None, names=cols)

df.groupby(['date'])[['count','unique']].agg('count':'median','unique':'median').round().to_csv('d:/temp/out.csv', header=None)

out.csv:

2004-01-05,764,17044.0
2004-01-06,757,17262.0

【讨论】:

需要四舍五入的值 @SitzBlogz,我添加了.round() 太棒了,这很好,但我仍然有浮点值:(我可以将它们更改为整数吗? 是的,我检查了堆栈点,你比他少,所以选择你的 ans 再次感谢你.. 如果你碰巧有一点时间,在这方面也能得到一些帮助会很棒..***.com/questions/38344487/…【参考方案2】:

你需要:

import pandas as pd
import io

temp=u"""2004-01-05,21:00:00,22:00:00,Mon,16553,783
2004-01-05,22:00:00,23:00:00,Mon,18944,790
2004-01-05,23:00:00,00:00:00,Mon,17534,750
2004-01-06,00:00:00,01:00:00,Tue,17262,747
2004-01-06,01:00:00,02:00:00,Tue,19072,777
2004-01-06,02:00:00,03:00:00,Tue,18275,785
2004-01-06,03:00:00,04:00:00,Tue,13589,757
2004-01-06,04:00:00,05:00:00,Tue,16053,735
2004-01-06,05:00:00,06:00:00,Tue,11440,636"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[0], names=['date', 'startTime', 'endTime', 'day', 'count', 'unique'])
print (df)

outf = df.groupby('date')['count', 'unique'].median().round().astype(int)
print (outf)
            count  unique
date                     
2004-01-05  17534     783
2004-01-06  16658     752


outf.to_csv('date_med.csv', header=False)

时间安排

In [20]: %timeit df.groupby('date')['count', 'unique'].median().round().astype(int)
The slowest run took 4.47 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.67 ms per loop

In [21]: %timeit df.groupby(['date'])[['count','unique']].agg('count':'median','unique':'median').round().astype(int)
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 3.64 ms per loop

【讨论】:

我收到此错误Traceback (most recent call last): File "date_median.py", line 15, in <module> outf = pd.DataFrame('date_count': date_count, 'count_med': count_med, 'unique_med': unique_med).reset_index() File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5231, in _arrays_to_mgr index = extract_index(arrays) File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5270, in extract_index raise ValueError('If using all scalar values, you must pass' ValueError: If using all scalar values, you must pass an index 对不起,我编辑答案。现在效果很好,还有将 float 转换为 int。 这取决于你。 ;) 非常感谢@MaxU 的堆栈点很低,所以让我帮助他获得更多。 好的,没问题。只有我的解决方案更快,所以你可以使用它。

以上是关于df.to_csv 结构化输出的主要内容,如果未能解决你的问题,请参考以下文章

pandas:dataframe to_csv,如何设置列名

使用 pandas 的 df.to_csv 方法不适用于空格作为分隔符

Pandas to_csv(sys.stdout) 在我的环境下不起作用

如何使用 df.to_csv 为多索引数据帧 python3 格式化 csv 文件

Pandas.to_csv() 十进制参数

使用自定义名称将多个pandas DataFrame输出为CSV