PySpark 数据帧操作导致 OutOfMemoryError

Posted

技术标签:

【中文标题】PySpark 数据帧操作导致 OutOfMemoryError【英文标题】:PySpark dataframe operation causes OutOfMemoryError 【发布时间】:2020-01-07 07:23:20 【问题描述】:

我刚刚开始尝试使用 pyspark/spark 并遇到了我的代码无法正常工作的问题。我找不到问题,火花的错误输出不是很有帮助。我确实在 *** 上找到了类似的问题,但没有一个有明确的答案或解决方案(至少对我来说不是)。

我要运行的代码是:

import json
from datetime import datetime, timedelta

from pyspark.sql.session import SparkSession

from parse.data_reader import read_csv
from parse.interpolate import insert_time_range, create_time_range, linear_interpolate

spark = SparkSession.builder.getOrCreate()

df = None
with open('config/data_sources.json') as sources_file:
    sources = json.load(sources_file)
    for file in sources['files']:
        with open('config/mappings/.json'.format(file['mapping'])) as mapping:
            df_to_append = read_csv(
                spark=spark,
                file=''.format(sources['root_path'], file['name']),
                config=json.load(mapping)
            )

        if df is None:
            df = df_to_append
        else:
            df = df.union(df_to_append)

df.sort(["Timestamp", "Variable"]).show(n=5, truncate=False)

time_range = create_time_range(
    datetime(year=2019, month=7, day=1, hour=0),
    datetime(year=2019, month=7, day=8, hour=0),
    timedelta(seconds=3600)
)

df_with_intervals = insert_time_range(
    df=df,
    timestamp_column_name='Timestamp',
    variable_column_name='Variable',
    value_column_name='Value',
    time_range=time_range,
)
df_with_intervals.sort(["Timestamp", "Variable"]).show(n=5, truncate=False)

它给出以下输出:

C:\Users\mmun01\PycharmProjects\xxxx\venv\Scripts\python.exe C:/Users/mmun01/PycharmProjects/xxxx/application.py
19/09/04 13:31:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/09/04 13:31:36 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
[Stage 4:=======================>                                   (2 + 3) / 5]19/09/04 13:31:52 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
View job details at https://xxxxxx.azuredatabricks.net/?o=xxxxxx#/setting/clusters/xxxxxx/sparkUi
[Stage 5:===========>                                               (1 + 4) / 5]+-----------------------+------------+-----+
|Timestamp              |Variable    |Value|
+-----------------------+------------+-----+
|2019-07-01 00:00:06.664|Load % PS DG|0.0  |
|2019-07-01 00:00:06.664|Load % SB DG|0.0  |
|2019-07-01 00:00:06.664|Power PS DG |null |
|2019-07-01 00:00:06.664|Power SB DG |null |
|2019-07-01 00:00:06.664|Power Shore |null |
+-----------------------+------------+-----+
only showing top 5 rows

Traceback (most recent call last):
  File "C:/Users/mmun01/PycharmProjects/xxxx/application.py", line 42, in <module>
    df_with_intervals.sort(["Timestamp", "Variable"]).show(n=5, truncate=False)
  File "C:\Users\mmun01\PycharmProjects\xxxx\venv\lib\site-packages\pyspark\sql\dataframe.py", line 381, in show
    print(self._jdf.showString(n, int(truncate), vertical))
  File "C:\Users\mmun01\PycharmProjects\xxxx\venv\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Users\mmun01\PycharmProjects\xxxx\venv\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\Users\mmun01\PycharmProjects\xxxx\venv\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o655.showString.
: java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Unknown Source)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(Unknown Source)
    at java.lang.AbstractStringBuilder.append(Unknown Source)
    at java.lang.StringBuilder.append(Unknown Source)
    at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:210)
    at com.trueaccord.scalapb.textformat.TextGenerator.maybeNewLine(TextGenerator.scala:13)
    at com.trueaccord.scalapb.textformat.TextGenerator.addNewLine(TextGenerator.scala:33)
    at com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:38)
    at com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
    at com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
    at com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
    at com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
    at com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
    at com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
    at com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
    at com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
    at com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
    at com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
    at com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
    at com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
    at com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
    at com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
    at com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)


Process finished with exit code 1

我使用的两个函数是:

def create_time_range(start_time: datetime, end_time: datetime, step_size: timedelta) -> Iterable[datetime]:
    return [start_time + step_size * n for n in range(int((end_time - start_time) / step_size))]


def insert_time_range(df: DataFrame, timestamp_column_name: str, variable_column_name: str, value_column_name: str,
                      time_range: Iterable[datetime]) -> DataFrame:
    time_range = array([lit(ts) for ts in time_range])
    df_exploded = df \
        .drop(value_column_name) \
        .drop(timestamp_column_name) \
        .distinct() \
        .withColumn(value_column_name, lit(None)) \
        .withColumn(timestamp_column_name, explode(time_range))
    return df.union(df_exploded.select([timestamp_column_name, variable_column_name, value_column_name]))

data_sources.json 文件目前仅包含一个 csv 文件,大小为几 MB。是什么导致 OutOfMemoryException 或如何获得更详细的错误报告?

根据niuer 的建议,我将函数insert_time_range 更改为:

def insert_time_range(df: DataFrame, timestamp_column_name: str, variable_column_name: str, value_column_name: str,
                      time_range: Iterable[datetime]) -> DataFrame:
    time_range = array([lit(ts) for ts in time_range])
    df_exploded = df \
        .drop(value_column_name) \
        .drop(timestamp_column_name) \
        .distinct() \
        .withColumn(value_column_name, lit(None)) \
        .withColumn(timestamp_column_name, lit(time_range[0]))
    return df_exploded.select([timestamp_column_name, variable_column_name, value_column_name])

.show() 调用之前,我添加了一行print(df_with_intervals.count()),它输出数字5(如预期的那样)。但是当我尝试show() 时,我得到的值仍然相同OutOfMemoryException

更新 我已将问题范围缩小到工会,但仍不清楚为什么它不起作用。我已经根据 cmets 中的建议更新了 insert_time_range 方法:

def insert_time_range(df: DataFrame, timestamp_column_name: str, variable_column_name: str, value_column_name: str,
                      time_range: Iterable[datetime]) -> DataFrame:
    schema = StructType(
        [
            StructField(timestamp_column_name, TimestampType(), True),
            StructField(value_column_name, DoubleType(), True)
        ]
    )
    df_time_range = df.sql_ctx.createDataFrame(
        [(timestamp, None) for timestamp in time_range],
        schema=schema
    )
    df_time_range = df.select([variable_column_name]).distinct().crossJoin(df_time_range).select(
        [timestamp_column_name, variable_column_name, value_column_name]
    )
    df_time_range.show(n=20, truncate=False)

    return df.union(df_time_range)

给出以下输出:

C:\Users\mmun01\PycharmProjects\xxxx\venv\Scripts\python.exe C:/Users/mmun01/PycharmProjects/xxxx/application.py
19/09/09 23:00:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/09/09 23:00:30 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
[Stage 44:==================================>                       (3 + 2) / 5]19/09/09 23:00:43 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
View job details at https://westeurope.azuredatabricks.net/?o=2202252276771286#/setting/clusters/0903-124716-art213/sparkUi
[Stage 45:===========>                                              (1 + 4) / 5]+-----------------------+------------+-----+
|Timestamp              |Variable    |Value|
+-----------------------+------------+-----+
|2019-07-01 00:00:06.664|Load % PS DG|0.0  |
|2019-07-01 00:00:06.664|Load % SB DG|0.0  |
|2019-07-01 00:00:06.664|Power PS DG |null |
|2019-07-01 00:00:06.664|Power SB DG |null |
|2019-07-01 00:00:06.664|Power Shore |null |
+-----------------------+------------+-----+
only showing top 5 rows

View job details at https://westeurope.azuredatabricks.net/?o=2202252276771286#/setting/clusters/0903-124716-art213/sparkUi
+-------------------+------------+-----+
|Timestamp          |Variable    |Value|
+-------------------+------------+-----+
|2019-06-30 22:00:00|Load % PS DG|null |
|2019-06-30 22:00:00|Power PS DG |null |
|2019-06-30 22:00:00|Power Shore |null |
|2019-06-30 22:00:00|Load % SB DG|null |
|2019-06-30 22:00:00|Power SB DG |null |
|2019-06-30 22:01:00|Load % PS DG|null |
|2019-06-30 22:01:00|Power PS DG |null |
|2019-06-30 22:01:00|Power Shore |null |
|2019-06-30 22:01:00|Load % SB DG|null |
|2019-06-30 22:01:00|Power SB DG |null |
|2019-06-30 22:02:00|Load % PS DG|null |
|2019-06-30 22:02:00|Power PS DG |null |
|2019-06-30 22:02:00|Power Shore |null |
|2019-06-30 22:02:00|Load % SB DG|null |
|2019-06-30 22:02:00|Power SB DG |null |
|2019-06-30 22:03:00|Load % PS DG|null |
|2019-06-30 22:03:00|Power PS DG |null |
|2019-06-30 22:03:00|Power Shore |null |
|2019-06-30 22:03:00|Load % SB DG|null |
|2019-06-30 22:03:00|Power SB DG |null |
+-------------------+------------+-----+
only showing top 20 rows

Traceback (most recent call last):
  File "C:/Users/mmun01/PycharmProjects/xxxx/application.py", line 46, in <module>
    df_with_intervals.sort([timestamp_column_name, variable_column_name]).show(n=5, truncate=False)
  File "C:\Users\mmun01\PycharmProjects\xxxx\venv\lib\site-packages\pyspark\sql\dataframe.py", line 381, in show
    print(self._jdf.showString(n, int(truncate), vertical))
  File "C:\Users\mmun01\PycharmProjects\xxxx\venv\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Users\mmun01\PycharmProjects\xxxx\venv\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\Users\mmun01\PycharmProjects\xxxx\venv\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o333.showString.
: java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Unknown Source)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(Unknown Source)
    at java.lang.AbstractStringBuilder.append(Unknown Source)
    at java.lang.StringBuilder.append(Unknown Source)
    at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:210)
    at com.trueaccord.scalapb.textformat.TextGenerator.maybeNewLine(TextGenerator.scala:13)
    at com.trueaccord.scalapb.textformat.TextGenerator.add(TextGenerator.scala:19)
    at com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:33)
    at com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
    at com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
    at com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
    at com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
    at com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
    at com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
    at com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
    at com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
    at com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
    at com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
    at com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
    at com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
    at com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
    at com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
    at com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)


Process finished with exit code 1

所以问题一定出在union 方法中,但我不知道问题是什么?

更新在我的第一次尝试中,config/data_sources.json 中只有一个 CSV 文件,因此从未执行过 df = df.union(df_to_append) 行。现在我在config/data_sources.json 中添加了多个CSV 文件,然后执行union 方法,我再次收到py4j.protocol.Py4JJavaError: An error occurred while calling o2043.showString. : java.lang.OutOfMemoryError: Java heap space 错误,但它已经发生在第一个union 中。这个方法我做错了什么或者这个方法本身有错误?

【问题讨论】:

尝试使用time_range 创建一个单列数据框,然后执行crossJoin。 Pyspark 可能会自动为您重新分区和优化“爆炸”。 感谢您的评论,结果见我上面的更新。 为什么需要在crossJoin 后面加上union?你可以直接返回df_time_range。您可能还想尝试将df 写入中间位置。 crossJoin 的原因是df_time_range 不包含原始样本。之后的步骤是插入 df_time_range时间戳的值。 我说的是最后的union 步骤。那没有必要。你只需要返回最后一个df_time_range 【参考方案1】:

它可能来自您正在做的explode。您基本上对从 json 文件生成的所有行与time_range 中的日期时间进行交叉连接,它有 168 个元素。 我会先用F.lit() 替换explode,看看它是否运行。如果还有问题,我会删除union代码试试。

【讨论】:

insert_time_range 函数的目标是以固定的时间间隔插入行,下一步是插入值。插值完成后,非固定时间间隔上的值将被丢弃。 从源文件中读取的df有346265行。每行包含一个时间戳(~13 个字节)、一个变量(~50 个字节)和一个值(~10 个字节)。所以一行应该是 100 字节的顺序。总大小约为 33MB。然后,当我尝试在每个小时(一个月内)为每个变量(总共 5 个变量)添加一行时,我添加了 3720 行,这在我看来应该不会产生太大的影响。 还可以按照您的建议查看对原始问题的补充。唉,没有结果。 您的 Spark 集群设置是什么?尝试增加驱动程序/工作人员的内存。 我正在使用 Azure Databricks,驱动程序和工作人员都是 Standard_DS5_v2 56.0 GB Memory, 16 Cores, 3 DBU。这应该绰绰有余,数据集真的没有那么大。我可以在本地不太花哨的笔记本电脑上使用 pandas 轻松处理它。 '【参考方案2】:

我查看了你与牛儿的沟通。

您确定,您正在使用seconds=3600。我问这个是因为错误之前的 DF 在您的更新中指示 1 minute interval。使用seconds=60total range = 1 month,每个原始行将有44640 新行。这是相当大的数据爆炸。

另外,在distinct 之后添加repartition

df_exploded = df \
        .drop(value_column_name) \
        .drop(timestamp_column_name) \
        .distinct() \
        .repartition(2000) \ # some sane value please
        .withColumn(value_column_name, lit(None)) \
        .withColumn(timestamp_column_name, lit(time_range[0]))

【讨论】:

确实,我尝试了几个间隔选项,以确保数据不会变得太大。源 CSV 文件中只有 5 个变量,因此数据不可能那么大。 尝试在 distinct 后添加重新分区。【参考方案3】:

在我看来,通过在此循环中读取,您正在将所有内容读入单台机器(很可能是运行驱动程序的主机)的内存中(如果您没有在 NFS 中读取,也可能出现延迟问题)。您应该尝试以下方法:

sparkcontext.wholeTextFiles("path/*")

【讨论】:

我不确定我是否明白你的意思? CSV 文件可以有不同的方案,所以我必须独立阅读它们。 是否需要排序?根据数据的大小和可用内存,排序可能会很昂贵。

以上是关于PySpark 数据帧操作导致 OutOfMemoryError的主要内容,如果未能解决你的问题,请参考以下文章

pyspark 数据帧上的向量操作

将 numpy 数组的 rdd 转换为 pyspark 数据帧

如何根据来自其他 pyspark 数据帧的日期值过滤第二个 pyspark 数据帧?

两个数据帧的数组列的平均值并在pyspark中找到最大索引

无法将日志功能应用于 pyspark 数据帧

在 pyspark 中,我想将值的数据帧列传递给函数并在该数据列中操作说,第 5 个值