Pandas - 块之间有重叠的块 read_csv

Posted 2023-03-11

技术标签:

【中文标题】Pandas - 块之间有重叠的块 read_csv【英文标题】：Pandas - chunk read_csv with overlap between chunks 【发布时间】：2020-08-13 20:09:23 【问题描述】：

问题陈述

如何使用pandas分块读取csv文件，其中块之间有重叠？

例如，假设列表indexes 表示我希望读取的某个数据帧的索引。

indexes = [0,1,2,3,4,5,6,7,8,9]

read_csv(filename, chunksize=None):

indexes = [0,1,2,3,4,5,6,7,8,9]  # read in all indexes at once

read_csv(filename, chunksize=5):

indexes = [[0,1,2,3,4], [5,6,7,8,9]]  # iteratively read in mutually exclusive index sets

read_csv(filename, chunksize=5, overlap=2):

indexes = [[0,1,2,3,4], [3,4,5,6,7], [6,7,8,9]]  # iteratively read in indexes sets with overlap size 2

工作解决方案

我有一个使用 skiprows 和 nrows 的 hack 解决方案，但它在读取 csv 文件时变得越来越慢。

indexes = [*range(10)]
chunksize = 5
overlap_count = 2
row_count = len(indexes)  # this I can work out before reading the whole file in rather cheaply

chunked_indexes = [(i, i + chunksize) for i in range(0, row_count, chunksize - overlap_count)]  # final chunk here may be janky, assume it works for now (it's more about the logic)
for chunk in chunked_indexes:
    skiprows = [*range(chunk[0], chunk[1])]
    pd.read_csv(filename, skiprows=skiprows, nrows=chunksize)

是否有人对此问题有任何见解或改进的解决方案？

【问题讨论】：

【参考方案1】：

我认为你应该向skiprow 传递一个数字而不是列表，试试：

for i in list(range(0, row_count-overlap_count, chunksize - overlap_count)):
    print (pd.read_csv('test.csv', 
                       skiprows=i+1, #here it is +1 because the first row was header 
                       nrows=chunksize, 
                       index_col=0, # this was how I save my csv
                       header=None) # you may need to read header before
             .index)
Int64Index([0, 1, 2, 3, 4], dtype='int64', name=0)
Int64Index([3, 4, 5, 6, 7], dtype='int64', name=0)
Int64Index([6, 7, 8, 9], dtype='int64', name=0)

【讨论】：

感谢您，但不幸的是，将 skiprows 作为整数传递并不能解决读取文件在更高迭代中逐渐变慢的问题。以下是一些测试结果：迭代#0，时间：0.0039s 迭代#100，时间：0.0208s 迭代#200，时间：0.0403s 迭代#300，时间：0.0597s @Andy 很有趣！出于好奇，在使用read_csv的参数chunksize的时候，时间上是没有区别的吧？原来是有区别的！带有参数 chunksize 的 read_csv 在迭代之间具有 ~ 恒定时间 @Andy 是的，这是我想说的，请原谅我的英语:) 你试过使用参数engine='c'，文档说它更快。它不会改变迭代之间的差异，但可能整体更快

以上是关于Pandas - 块之间有重叠的块 read_csv的主要内容，如果未能解决你的问题，请参考以下文章