Pandas - 块之间有重叠的块 read_csv
Posted
技术标签:
【中文标题】Pandas - 块之间有重叠的块 read_csv【英文标题】:Pandas - chunk read_csv with overlap between chunks 【发布时间】:2020-08-13 20:09:23 【问题描述】:问题陈述
如何使用pandas分块读取csv文件,其中块之间有重叠?
例如,假设列表indexes
表示我希望读取的某个数据帧的索引。
indexes = [0,1,2,3,4,5,6,7,8,9]
read_csv(filename, chunksize=None):
indexes = [0,1,2,3,4,5,6,7,8,9] # read in all indexes at once
read_csv(filename, chunksize=5):
indexes = [[0,1,2,3,4], [5,6,7,8,9]] # iteratively read in mutually exclusive index sets
read_csv(filename, chunksize=5, overlap=2):
indexes = [[0,1,2,3,4], [3,4,5,6,7], [6,7,8,9]] # iteratively read in indexes sets with overlap size 2
工作解决方案
我有一个使用 skiprows 和 nrows 的 hack 解决方案,但它在读取 csv 文件时变得越来越慢。
indexes = [*range(10)]
chunksize = 5
overlap_count = 2
row_count = len(indexes) # this I can work out before reading the whole file in rather cheaply
chunked_indexes = [(i, i + chunksize) for i in range(0, row_count, chunksize - overlap_count)] # final chunk here may be janky, assume it works for now (it's more about the logic)
for chunk in chunked_indexes:
skiprows = [*range(chunk[0], chunk[1])]
pd.read_csv(filename, skiprows=skiprows, nrows=chunksize)
是否有人对此问题有任何见解或改进的解决方案?
【问题讨论】:
【参考方案1】:我认为你应该向skiprow
传递一个数字而不是列表,试试:
for i in list(range(0, row_count-overlap_count, chunksize - overlap_count)):
print (pd.read_csv('test.csv',
skiprows=i+1, #here it is +1 because the first row was header
nrows=chunksize,
index_col=0, # this was how I save my csv
header=None) # you may need to read header before
.index)
Int64Index([0, 1, 2, 3, 4], dtype='int64', name=0)
Int64Index([3, 4, 5, 6, 7], dtype='int64', name=0)
Int64Index([6, 7, 8, 9], dtype='int64', name=0)
【讨论】:
感谢您,但不幸的是,将 skiprows 作为整数传递并不能解决读取文件在更高迭代中逐渐变慢的问题。以下是一些测试结果:迭代#0,时间:0.0039s 迭代#100,时间:0.0208s 迭代#200,时间:0.0403s 迭代#300,时间:0.0597s @Andy 很有趣!出于好奇,在使用read_csv的参数chunksize的时候,时间上是没有区别的吧? 原来是有区别的!带有参数 chunksize 的 read_csv 在迭代之间具有 ~ 恒定时间 @Andy 是的,这是我想说的,请原谅我的英语:) 你试过使用参数engine='c'
,文档说它更快。它不会改变迭代之间的差异,但可能整体更快以上是关于Pandas - 块之间有重叠的块 read_csv的主要内容,如果未能解决你的问题,请参考以下文章