从箭头格式到熊猫数据框的转换是不是会在堆上重复数据？

Posted 2023-03-28

技术标签:

【中文标题】从箭头格式到熊猫数据框的转换是不是会在堆上重复数据？【英文标题】：Does conversion from arrow format to pandas dataframe duplicate data on the heap?从箭头格式到熊猫数据框的转换是否会在堆上重复数据？ 【发布时间】：2022-01-12 23:06:10 【问题描述】：

我试图找出在从箭头文件读取并转换为 pandas 数据帧时导致高内存使用的原因。当我查看堆时，似乎 pandas 数据框的大小几乎与 numpy 数组相等。使用 guppy hpy().heap() 的堆输出示例：

Partition of a set of 351136 objects. Total size = 20112096840 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0    121   0 9939601034  49 9939601034  49 numpy.ndarray
     1      1   0 9939585700  49 19879186734  99 pandas.core.frame.DataFrame
     2      1   0 185786680   1 20064973414 100 pandas.core.indexes.datetimes.DatetimeIndex

我写了一个测试脚本来更好地说明我在说什么，虽然我使用了不同的方法来使用转换，但概念是一样的：

import numpy as np
import pandas as pd
import pyarrow as pa
from pyarrow import feather
from guppy import hpy
import psutil
import os
import time

DATA_FILE = 'test.arrow'
process = psutil.Process(os.getpid()) 

def setup():
  np.random.seed(0)
  df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)), columns=list([f'i' for i in range(57)]))
  mem_size = process.memory_info().rss / 1e9
  print(f'before feather mem_sizegb: \nhpy().heap()')
  df.to_feather(DATA_FILE)
  time.sleep(5)
  mem_size = process.memory_info().rss / 1e9
  print(f'after writing to feather mem_sizegb: \nhpy().heap()')
  print(f'wrote DATA_FILE')
  import sys
  sys.exit()

def foo():
  mem_size = process.memory_info().rss / 1e9
  path = DATA_FILE
  print(f'before reading table mem_sizegb: \nhpy().heap()')
  feather_table = feather.read_table(path)
  mem_size = process.memory_info().rss / 1e9
  print(f'after reading table mem_sizegb: \nhpy().heap()')
  df = feather_table.to_pandas()
  mem_size = process.memory_info().rss / 1e9
  print(f'after converting to pandas mem_sizegb: \nhpy().heap()')
  return df

if __name__ == "__main__":
  #setup()
  df = foo()
  time.sleep(5)
  mem_size = process.memory_info().rss / 1e9
  print(f'final heap mem_sizegb: \nhpy().heap()')

在调用foo()之前需要先调用setup()。

输出（来自设置）：

before feather 3.374010368gb:
Partition of a set of 229931 objects. Total size = 3313572857 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 3281625136  99 3281625136  99 pandas.core.frame.DataFrame
     1  59491  26  9902952   0 3291528088  99 str
     2  64105  28  5450160   0 3296978248  99 tuple
     3  30157  13  2339796   0 3299318044 100 bytes
     4  15221   7  2203888   0 3301521932 100 types.CodeType
     5  14449   6  2080656   0 3303602588 100 function
     6   6674   3  2018224   0 3305620812 100 dict (no owner)
     7   1860   1  1539768   0 3307160580 100 type
     8    630   0  1158616   0 3308319196 100 dict of module
     9   1860   1  1078064   0 3309397260 100 dict of type
<616 more rows. Type e.g. '_.more' to view.>
after writing to feather 3.40015104gb:
Partition of a set of 230564 objects. Total size = 6595283738 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0     57   0 3281634096  50 3281634096  50 pandas.core.series.Series
     1      1   0 3281625136  50 6563259232 100 pandas.core.frame.DataFrame
     2  59548  26  9905849   0 6573165081 100 str
     3  64073  28  5445176   0 6578610257 100 tuple
     4  30153  13  2339608   0 6580949865 100 bytes
     5  15219   7  2203600   0 6583153465 100 types.CodeType
     6   6845   3  2064024   0 6585217489 100 dict (no owner)
     7  14304   6  2059776   0 6587277265 100 function
     8   1860   1  1540224   0 6588817489 100 type
     9    630   0  1158616   0 6589976105 100 dict of module
<627 more rows. Type e.g. '_.more' to view.>
wrote test.arrow

输出（正常运行无设置）：

before reading table 0.092004352gb:
Partition of a set of 229908 objects. Total size = 31941164 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  59491  26  9902952  31   9902952  31 str
     1  64104  28  5450096  17  15353048  48 tuple
     2  30157  13  2339788   7  17692836  55 bytes
     3  15221   7  2203888   7  19896724  62 types.CodeType
     4  14449   6  2080656   7  21977380  69 function
     5   6669   3  2016984   6  23994364  75 dict (no owner)
     6   1860   1  1539768   5  25534132  80 type
     7    630   0  1158616   4  26692748  84 dict of module
     8   1860   1  1078064   3  27770812  87 dict of type
     9   1979   1   490792   2  28261604  88 dict of function
<605 more rows. Type e.g. '_.more' to view.>
after reading table 3.512406016gb:
Partition of a set of 229383 objects. Total size = 3313510008 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 3281625032  99 3281625032  99 pyarrow.lib.Table
     1  59491  26  9902952   0 3291527984  99 str
     2  63952  28  5436848   0 3296964832 100 tuple
     3  30153  13  2339600   0 3299304432 100 bytes
     4  15219   7  2203600   0 3301508032 100 types.CodeType
     5  14303   6  2059632   0 3303567664 100 function
     6   6669   3  2016984   0 3305584648 100 dict (no owner)
     7   1860   1  1539768   0 3307124416 100 type
     8    630   0  1158616   0 3308283032 100 dict of module
     9   1860   1  1078064   0 3309361096 100 dict of type
<604 more rows. Type e.g. '_.more' to view.>
after converting to pandas 6.797561856gb:
Partition of a set of 229432 objects. Total size = 6595149289 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 3281625136  50 3281625136  50 pandas.core.frame.DataFrame
     1      1   0 3281625032  50 6563250168 100 pyarrow.lib.Table
     2  59491  26  9902952   0 6573153120 100 str
     3  63965  28  5437856   0 6578590976 100 tuple
     4  30153  13  2339600   0 6580930576 100 bytes
     5  15219   7  2203600   0 6583134176 100 types.CodeType
     6  14303   6  2059632   0 6585193808 100 function
     7   6673   3  2020016   0 6587213824 100 dict (no owner)
     8   1860   1  1540264   0 6588754088 100 type
     9    630   0  1158616   0 6589912704 100 dict of module
<618 more rows. Type e.g. '_.more' to view.>
final heap 6.79968768gb:
Partition of a set of 230570 objects. Total size = 6595283554 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0     57   0 3281634096  50 3281634096  50 pandas.core.series.Series
     1      1   0 3281625136  50 6563259232 100 pandas.core.frame.DataFrame
     2  59538  26  9905349   0 6573164581 100 str
     3  64080  28  5445672   0 6578610253 100 tuple
     4  30153  13  2339600   0 6580949853 100 bytes
     5  15219   7  2203600   0 6583153453 100 types.CodeType
     6   6844   3  2062552   0 6585216005 100 dict (no owner)
     7  14304   6  2059776   0 6587275781 100 function
     8   1860   1  1540264   0 6588816045 100 type
     9    630   0  1159152   0 6589975197 100 dict of module
<627 more rows. Type e.g. '_.more' to view.>

数据框似乎在 pd.Series 表示的堆上有一个副本。首次创建数据框时，它没有，只有当它被写入箭头/羽毛文件时。一旦我们读取了这个文件，这些系列就会返回，并且与预期的数据帧大小相同。

【问题讨论】：

【参考方案1】：

从箭头格式转换为 pandas 数据框会在堆上重复数据吗？

文档很好地解释了正在发生的事情：https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy

在您的情况下，数据确实被复制了。在某些情况下，您无需复制数据即可逃脱。

但我无法理解 guppy 的输出。例如，在最终堆中，当箭头表超出范围时，它看起来有两个数据副本（一个在 DataFrame 中，一个在 57 系列中），而实际上我预计只有 3gb。

【讨论】：

是的，我认为让我感到困惑的是函数返回后的 guppy 输出。我认为那时原始数据会被删除。当我使用 split_blocks=True、self_destruct=True 重新运行测试并删除表时，RSS 约为 3GB，而 guppy 报告约为 6gb。我猜孔雀鱼可能不是那么准确。很难说【参考方案2】：

pyarrow、pandas 和 numpy 对相同的底层内存都有不同的看法。孔雀鱼似乎无法识别这一点（我想这样做会很困难）。因此，它似乎是重复计算。这是一个简单的例子：

import numpy as np
import os
import psutil
import pyarrow as pa
from guppy import hpy

process = psutil.Process(os.getpid())

# Will consume ~800MB of RAM                                                                                                                                                                                       
x = np.random.rand(100000000)
print(hpy().heap())
# Partition of a set of 98412 objects. Total size = 813400879 bytes.
print(process.memory_info().rss)
# 855588864

# This is a zero-copy operation.  Note                                                                                                                                                                             
# that RSS remains consistent.  Both x                                                                                                                                                                             
# and arr reference the same underlying                                                                                                                                                                            
# array of doubles.                                                                                                                                                                                                
arr = pa.array(x)
print(hpy().heap())
# Partition of a set of 211452 objects. Total size = 1629410271 bytes.
print(process.memory_info().rss)
# 891699200

【讨论】：

以上是关于从箭头格式到熊猫数据框的转换是不是会在堆上重复数据？的主要内容，如果未能解决你的问题，请参考以下文章