熊猫 read_csv 和 UTF-16

Posted

技术标签:

【中文标题】熊猫 read_csv 和 UTF-16【英文标题】:Pandas read_csv and UTF-16 【发布时间】:2012-11-21 07:47:10 【问题描述】:

我有一个以 UTF-16 编码的 CSV 文本文件(以便在其他人使用 Excel 时保留 Unicode 字符),但是当使用 Pandas 0.9.0 执行 read_csv 时,我得到了这个神秘的错误:

df = pd.read_csv('data.txt',encoding='utf-16',sep='\t',header=0)
df.head()

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-18-85da1383cd9e> in <module>()
----> 1 df = pd.read_csv('candidates-spanish.txt',encoding='utf-16',sep='\t',header=0)
  2 df.head()

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in read_csv(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, keep_default_na, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze, **kwds)
248         kdict['delimiter'] = sep
249 
--> 250     return _read(TextParser, filepath_or_buffer, kdict)
251 
252 @Appender(_read_table_doc)

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(cls, filepath_or_buffer, kwds)
198         return parser
199 
--> 200     return parser.get_chunk()
201 
202 @Appender(_read_csv_doc)

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in get_chunk(self, rows)
853         elif not self._has_complex_date_col:
854             index = self._get_simple_index(alldata, columns)
--> 855             index = self._agg_index(index)
856 
857         elif self._has_complex_date_col:

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _agg_index(self, index, try_parse_dates)
980                 arr, _ = _convert_types(arr, col_na_values)
981                 arrays.append(arr)
--> 982             index = MultiIndex.from_arrays(arrays, names=self.index_name)
983         return index
984 

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.pyc in from_arrays(cls, arrays, sortorder, names)
1570 
1571         return MultiIndex(levels=levels, labels=labels,
-> 1572                           sortorder=sortorder, names=names)
1573 
1574     @classmethod

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.pyc in __new__(cls, levels, labels, sortorder, names)
1254         assert(len(levels) == len(labels))
1255         if len(levels) == 0:
-> 1256             raise Exception('Must pass non-zero number of levels/labels')
1257 
1258         if len(levels) == 1:

Exception: Must pass non-zero number of levels/labels

使用基于this example的csv.reader逐行读取数据意味着我的数据格式不正确:

from io import BytesIO
import csv

with open('data.txt','rb') as f:
    r = f.read().decode('utf-16').encode('utf-8')
    for l in csv.reader(BytesIO(r),delimiter='\t'):
        print l

['Country', 'State/City', 'Title', 'Date', 'Catalogue', 'Wikipedia Election Page', 'Wikipedia Individual Page', 'Electoral Institution in Country', 'Twitter', 'CANDIDATE NAME 1', 'CANDIDATE NAME 2']
['Venezuela', 'N/A', 'President', '10/7/12', 'Hugo Rafael Chavez Frias', 'Hugo Ch\xc3\xa1vez', 'Hugo Ch\xc3\xa1vez', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez Fr\xc3\xadas', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez']
['Venezuela', 'N/A', 'President', '10/7/12', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles R.', 'Henrique Capriles', '']

在 pandas.read_csv 可以读取 utf-16 文件之前,是否有一些预处理、read_csv 中的附加选项或其他需要完成的操作?谢谢!

【问题讨论】:

您可以发布/通过电子邮件发送文本文件的版本吗?我去看看。 brianckeegan.com/data/candidates-spanish.txt 【参考方案1】:
from StringIO import StringIO
import pandas as pd

a = ['Venezuela', 'N/A', 'President', '10/7/12', 'Hugo Rafael Chavez Frias', 'Hugo Ch\xc3\xa1vez', 'Hugo Ch\xc3\xa1vez', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez Fr\xc3\xadas', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez']

pd.read_csv(StringIO('\t'.join(a)), delimiter='\t')

在这里工作可以上传你的数据头,这样我就可以测试了

【讨论】:

【参考方案2】:

这是一个错误,我认为是因为 csv 阅读器在一开始就传回了一个额外的空行。如果我这样做,它在 Python 2.7.3 和 pandas 0.9.1 上对我有用:

In [36]: pd.read_csv(BytesIO(fh.read().decode('UTF-16').encode('UTF-8')), sep='\t', header=0)
Out[36]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 49
Data columns:
Country                             43  non-null values
State/City                          43  non-null values
Title                               43  non-null values
Date                                43  non-null values
Catalogue                           43  non-null values
Wikipedia Election Page             43  non-null values
Wikipedia Individual Page           43  non-null values
Electoral Institution in Country    43  non-null values
Twitter                             43  non-null values
CANDIDATE NAME 1                    43  non-null values
CANDIDATE NAME 2                    16  non-null values
dtypes: object(11)

我在这里报告了这个错误:https://github.com/pydata/pandas/issues/2418 不幸的是,在 github master 上,它会导致 c-parser 中的段错误。我们会修复它。

现在,有趣的是:https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful ;)

【讨论】:

如果 Excel 与 UTF-8 配合得很好,那么我会使用它! :)【参考方案3】:

Python3:

with open('data.txt',encoding='UTF-16') as f:
    df = pd.read_csv(f)

【讨论】:

你现在可以直接把encoding='UTF-16'传给pd.read_csv(),以后就不用open()了。

以上是关于熊猫 read_csv 和 UTF-16的主要内容,如果未能解决你的问题,请参考以下文章

尾随分隔符使熊猫 read_csv 感到困惑

模块“熊猫”没有属性“read_csv”

如何加速熊猫 read_csv?

熊猫 read_csv dtype 前导零

来自 BytesIO 的熊猫 read_csv

熊猫的 read_csv 总是在小文件上崩溃