利用 PySpark，确定数组列中有多少元素包含在另一列中的数组数组中

Posted 2023-04-13

技术标签:

【中文标题】利用 PySpark，确定数组列中有多少元素包含在另一列中的数组数组中【英文标题】：Utilizing PySpark, determin how many of the elements within an array column are contained within an array of arrays in another column 【发布时间】：2020-08-17 16:54:21 【问题描述】：

我有一个如下数据集：

+--------------------+--------------------+
|                col1|                col2|
+--------------------+--------------------+
|[[563], [242, 178]] |          [563, 178]|
|[[563], [242, 178]] |     [563, 178, 242]|
|[[563], [242, 178]] |     [563, 242, 178]|
|[[563], [242, 178]] |     [242, 178, 563]|
+--------------------+--------------------+

我想做的是确定col2中按顺序有多少值包含在col1中。 col1 中的顺序仅在***数组中重要，在较低级别数组中无关紧要。

例如上述数据帧的输出应该是：

+--------------------+--------------------|------+
|                col1|                col2|Output+
+--------------------+--------------------+------+
|[[563], [242, 178]] |          [563, 178]|     2+
|[[563], [242, 178]] |     [563, 178, 242]|     3+
|[[563], [242, 178]] |     [563, 242, 178]|     3+
|[[563], [242, 178]] |     [242, 178, 563]|     2+
+--------------------+--------------------+------+

我相当确定为此需要一个 UDF，但我正在为如何遍历 col1 中的子数组而苦苦挣扎。

任何帮助将不胜感激！

斯宾塞

【问题讨论】：

【参考方案1】：

从 spark-2.4 使用 array_intersect 函数和 flatten 函数然后使用 size 函数获取数组中的元素数。

Example:

df.show()
#+-------------------+---------------+
#|               col1|           col2|
#+-------------------+---------------+
#|[[563], [242, 178]]|     [563, 178]|
#|[[563], [242, 178]]|[563, 178, 242]|
#+-------------------+---------------+

from pyspark.sql.functions import *


df.withColumn("flattend",flatten(col("col1"))).\
withColumn("output",size(array_intersect(col("col2"),col("flattend")))).\
drop("flattend").\
show()
#+-------------------+---------------+------+
#|               col1|           col2|output|
#+-------------------+---------------+------+
#|[[563], [242, 178]]|     [563, 178]|     2|
#|[[563], [242, 178]]|[563, 178, 242]|     3|
#+-------------------+---------------+------+

【讨论】：

这可以找到两列中的匹配总数，但不能解决我上面概述的顺序问题。

以上是关于利用 PySpark，确定数组列中有多少元素包含在另一列中的数组数组中的主要内容，如果未能解决你的问题，请参考以下文章

从 pyspark 中的数据框数组类型列中获取“名称”元素

如何从结构类型数组的列中删除特定元素

如何过滤 PySpark 中数组列中的值？

获取 PySpark 列中列表列表中第一个元素的最大值

如何根据 Pyspark 中数组列中的值创建新列

正则表达式在 PySpark Dataframe 列中查找所有不包含 _(Underscore) 和 :(Colon) 的字符串