如何将带有文本信息的 1.3 GB csv 文件读入 Python 的 pandas 对象?
Posted
技术标签:
【中文标题】如何将带有文本信息的 1.3 GB csv 文件读入 Python 的 pandas 对象?【英文标题】:How to read csv file of 1.3 GB with text information into Python's pandas object? 【发布时间】:2017-12-18 13:06:15 【问题描述】:我正在尝试使用“pd.read_csv”将包含两列和 19,333 行的 1.3 GB csv 文件读取到 Python 的 pandas 数据帧中,但它不断生成错误消息,提示“CParserError:错误标记数据。 C 错误:内存不足',我尝试了许多在线发布的建议,例如使用'chunksize',但它似乎不起作用,只会产生'Kernel dead,restarting'。这是运行“pd.read_csv”时的输出。
import pandas as pd
import numpy as np
import os
os.chdir("/home/swhan/Downloads")
CORPUS = pd.read_csv('10k_2005_2008_file.csv')
Traceback (most recent call last):
File "<ipython-input-1-8136c4f0354a>", line 7, in <module>
CORPUS = pd.read_csv('10k_2005_2008_file.csv')
File "/home/swhan/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/swhan/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 401, in _read
data = parser.read()
File "/home/swhan/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "/home/swhan/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1508, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:10415)
File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10691)
File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:11437)
File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:11308)
File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037)
CParserError: Error tokenizing data. C error: out of memory
实际上,csv文件由两列组成,一列用于ID,另一列用于每个ID的长文本信息,其中的一个子集如下所示:
id text
12 python pandas read data of the form ...
13 how to remove file does not exist error ...
41 pandas unable to find files ...
99 issue with python is not a simple problem ...
csv file picture
有没有办法将这个文件读入 pandas 的数据框对象?顺便说一句,我的桌面有 32GB 内存。提前谢谢!
使用带有“chunksize”的 Python 代码进行替代尝试
df = pd.DataFrame()
reader = pd.read_csv("10k_2005_2008_file.csv", chunksize=10**3)
for chunk in reader:
df = pd.concat([df, chunk], ignore_index=True)
df
Out[6]:
ID text
0 255618 ['ITEM1.BUSINESSIn this annual report onForm10...
1 94740 ['Item 1. Business.GeneralCommunity CapitalCor...
2 145200 ['ITEM 1.BUSINESSGeneralCommunityBank Shares o...
3 145201 ['ITEM 1. BUSINESSGeneralCommunity Bank Share...
4 145202 ['Item 1. BusinessGeneralCommunity Bank Shares...
5 145203 ['Item1.BusinessGeneralCommunityBank Shares of...
6 221548 ['Item1.BusinessOverviewTravelzoo Inc. (the Co...
7 121633 ['Item1. BusinessGeneralSterling Financial Cor...
8 172796 ['Item 1. BusinessGeneralWe are a Maryland cor...
9 172797 ['Item 1. BusinessGeneralWe are a Maryland cor...
10 121632 ['Item 1.BusinessGeneralCompanyGrowthProfitabi...
11 28995 ['ITEM 1. Business.(Dollars in millions)We res...
12 28994 ['ITEM 1. Business.GeneralAt December31, 2004,...
13 28997 ['Item1.Business.GeneralService Corporation In...
14 28996 ['ITEM 1. Business.GeneralAt December31, 2004,...
15 118636 ['Item1.BusinessWe are a broadcast company pri...
16 28993 ['ITEM 1. Business.GeneralAt December31, 2004,...
17 101760 ['ITEM1.BUSINESSCorporateProfileCognex Corpora...
18 145752 ['Item 1: Election of Directors; Nomineesfor D...
19 94744 ['ITEM1.BUSINESS.GeneralCommunityCapital Corpo...
20 28999 ['Item1.Business.GeneralService Corporation In...
21 28998 ['Item1.Business.GeneralService Corporation In...
22 1868 ['ITEM1.BUSINESSCompany OverviewWe are a world...
23 269745 ['Item1"BusinessThe CompanyThe 2004 Reorganiza...
24 181343 ['ITEM 1. BUSINESSMKS Instruments, Inc. ("the...
25 220768 ['ITEM1. BUSINESS General The Company Sierr...
26 181345 ['Item1.BusinessMKS Instruments, Inc. (the Com...
27 145750 ['Item1. Business BurlingtonNorthern Santa F...
28 181346 ['Item1.BusinessMKS Instruments, Inc. (the Com...
29 145751 ['Item 1: Election of Directors; Nominees for ...
... ...
19303 26477 ['ITEM1.BUSINESS Precision Castparts Corp. (P...
19304 256145 ['Item1 Business,Item1A Risk Factors, and Item...
19305 222814 ['Item1. Business. General Our company, Rock...
19306 73641 ['ITEM 1. BUSINESSGENERALTexas Regional Bancsh...
19307 66997 ['ITEM 1. BUSINESSOur CompanyWe are a leading ...
19308 66996 ['ITEM 1. BUSINESSOur CompanyWe are a leading ...
19309 66994 ['ITEM1. BUSINESS Our Company We are a leadi...
19310 66993 ['ITEM 1. BUSINESS Our CompanyWe are a leadi...
19311 7929 ['Item1. Business(a)General development of bus...
19312 114251 ['Item1.BusinessGeneralTerra Nitrogen Company,...
19313 114250 ['Item1 BusinessGeneralTerra Nitrogen Company,...
19314 198077 ['Item1. BusinessGeneral DescriptionTeam Finan...
19315 162197 ["ITEM 1. BUSINESSWintrust Financial Corporati...
19316 25524 ['Item 1. BusinessEnvironmental. Contamination...
19317 190015 ['Item 1. Description of Business.GeneralEVCI ...
19318 5634 ['Item 1.BusinessGeneral CDI Corp. (the Compa...
19319 5635 ['Item 1.BusinessGeneral CDI Corp. (the Compa...
19320 190932 ['ITEM 1. BUSINESSORGANIZATION AND GENERAL B...
19321 190933 ['ITEM 1. BUSINESSORGANIZATION AND GENERAL B...
19322 5632 ['Item 1.BusinessGeneral CDI Corp., (the Comp...
19323 5633 ['Item 1.BusinessGeneral CDI Corp. (the Compa...
19324 38349 ['Item 1. BusinessThe CompanyNatures SunshineP...
19325 222816 ['Item1 above.Weoperate on a 52/53 week fiscal...
19326 222815 ['Item1. Business.GeneralOur company, Rockwell...
19327 213793 ['Item1.BusinessTvia,Inc. is a fabless semicon...
19328 8489 ['ITEM1.BusinessCrown Crafts, Inc. (the Compan...
19329 224247 ['Item1.Business GENERAL We are asolutions...
19330 198076 ['Item1. BusinessGeneral DescriptionTeam Finan...
19331 34149 ['Item1. BusinessVF Corporation, organized in ...
19332 34148 ['Item1 in PartI, Items 5, 6, 7, 7A, 8 and 9A ...
[19333 rows x 2 columns]
【问题讨论】:
使用chunksize
时能显示输出吗?
我用带有“chunksize”的python代码编辑了上面的帖子。谢谢!
使用较小的块大小。可能有 1000 个?
是的,当我使用较小的块大小时,它似乎工作正常。谢谢!
【参考方案1】:
Pandas docs says:
注意然而值得注意的是,concat(因此 append)会生成数据的完整副本,并且不断重用 此功能可以产生显着的性能影响。如果你需要 对多个数据集使用操作,使用列表推导。
frames = [ process_your_file(f) for f in files ]
result = pd.concat(frames)
所以试试这个方法:
reader = pd.read_csv("10k_2005_2008_file.csv", chunksize=10**3)
df = pd.concat([x for x in reader], ignore_index=True)
【讨论】:
以上是关于如何将带有文本信息的 1.3 GB csv 文件读入 Python 的 pandas 对象?的主要内容,如果未能解决你的问题,请参考以下文章
如何有效且快速地将大型 (6 Gb) .csv 文件导入 R,而不会导致 R REPL 崩溃?