在单个 spark 数据框中减去两个字符串列的最佳 PySpark 实践是啥？

Posted 2023-03-31

技术标签:

【中文标题】在单个 spark 数据框中减去两个字符串列的最佳 PySpark 实践是啥？【英文标题】：What is the best PySpark practice to subtract two string columns within a single spark dataframe?在单个 spark 数据框中减去两个字符串列的最佳 PySpark 实践是什么？ 【发布时间】：2021-10-12 11:57:56 【问题描述】：

假设我有一个如下火花数据框：

data	A	Expected_column= data - A
https://example1.org/path/to/file?param=42#fragment	param=42#fragment	https://example1.org/path/to/file?
https://example2.org/path/to/file	NaN	https://example2.org/path/to/file

我在想是否有一个合适的过滤机制，将两个 string 列彼此相减，例如：

sdf1 = sdf.withColumn('Expected_column', ( sdf['data'] - sdf['A'] ))

这将为Expected_column 列的所有行返回Null。我检查了像question1 这样的不同解决方案，但它们正在处理两个数据帧，而我的情况是在一个数据帧内，而且他们的问题不是处理字符串列。最接近的问题是关于date differences，这又不是我的情况。

【问题讨论】：

【参考方案1】：

您要查找的函数名为replace：

from pyspark.sql import functions as F

sdf.withColumn("data - A", F.expr("replace(data, coalesce(A, ''), '')")).show(
    truncate=False
)
+---------------------------------------------------+-----------------+----------------------------------+
|data                                               |A                |data - A                          |
+---------------------------------------------------+-----------------+----------------------------------+
|https://example1.org/path/to/file?param=42#fragment|param=42#fragment|https://example1.org/path/to/file?|
|https://example2.org/path/to/file                  |null             |https://example2.org/path/to/file |
+---------------------------------------------------+-----------------+----------------------------------+

【讨论】：

将regexp_replace 更改为简单的replace。它应该工作它现在适用于所有情况。那么replace和regexp_replace替换性能差异的原因是什么？

以上是关于在单个 spark 数据框中减去两个字符串列的最佳 PySpark 实践是啥？的主要内容，如果未能解决你的问题，请参考以下文章