检查给定列表中的元素是不是存在于 DataFrame 的数组列中

Posted 2023-04-17

技术标签:

【中文标题】检查给定列表中的元素是不是存在于 DataFrame 的数组列中【英文标题】：To check if elements in a given list present in array column in DataFrame检查给定列表中的元素是否存在于 DataFrame 的数组列中 【发布时间】：2021-04-01 22:06:09 【问题描述】：

我有以下适用于熊猫数据框的功能

def event_list(df,steps):
    df['steps_present'] =  df['labels'].apply(lambda x:all(step in x for step in steps))
    return df

DataFrame 有一个名为标签的列，其值作为列表。此函数接受数据框和步骤（这是一个列表）并输出带有新列步骤的数据框如果参数列表中的所有元素都存在于数据框列中

value in df['labels'] =  [EBBY , ABBY , JULIE , ROBERTS]

event_list(df,['EBBY','ABBY']) 将为该记录返回 True，因为 EBBY 和 ABBY 存在于数据框列表列中。

我想在 pyspark 中创建一个类似的函数。

【问题讨论】：

【参考方案1】：

您可以使用array_except 检查提供的列表中的每个元素是否存在于标签列中。如果是，array_except 的结果大小将为 0。将大小与 0 进行比较将为您提供所需的布尔值。

import pyspark.sql.functions as F

def event_list(df, steps):
    return df.withColumn(
        'steps_present', 
        F.size(F.array_except(F.array(*[F.lit(l) for l in steps]), 'labels')) == 0
    )

df2 = event_list(df, ["EBBY", "ABBY"])

df2.show(truncate=False)
+----------------------------+-------------+
|labels                      |steps_present|
+----------------------------+-------------+
|[EBBY, ABBY, JULIE, ROBERTS]|true         |
|[EBBY, JULIE]               |false        |
+----------------------------+-------------+

【讨论】：

【参考方案2】：

您可以将函数转换为 UDF，可能如下所示。

from pyspark.sql.functions import lit, array

values = [(["EBBY" , "ABBY" , "JULIE" , "ROBERTS"],),
           (["EBBY" , "ABBY"],)]
columns = ['labels']
df = spark.createDataFrame(values, columns)

@udf
def event_list(column_to_test, input_values):
    return all(value in column_to_test for value in input_values)

steps = ["EBBY", "JULIE"]
df.withColumn("steps_present", event_list(df['labels'], array([lit(x) for x in steps]))).show(truncate=False)

【讨论】：

以上是关于检查给定列表中的元素是不是存在于 DataFrame 的数组列中的主要内容，如果未能解决你的问题，请参考以下文章