将列表作为参数传递给 udf 方法

Posted 2023-04-15

技术标签:

【中文标题】将列表作为参数传递给 udf 方法【英文标题】：Pass list to udf method as a parameter 【发布时间】：2021-05-16 20:15:58 【问题描述】：

使用文本处理库https://github.com/berknology/text-preprocessing

我想将 preprocess_functions 作为参数传递给 preprocess_text 方法

使用下面的例子：

def preprocess_text_spark(df: SparkDataFrame, 
                          target_column: str, 
                          preprocessed_column_name: str = 'preprocessed_text'
                         ) -> SparkDataFrame:


 """ Preprocess text in a column of a PySpark DataFrame by leveraging PySpark UDF to preprocess text in parallel """



preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, remove_special_character, normalize_unicode,  remove_number, remove_whitespace, remove_stopword, lemmatize_word, stem_word, check_spelling] 
_preprocess_text = udf(preprocess_text, StringType())
new_df = df.withColumn(preprocessed_column_name, _preprocess_text(df[target_column],preprocess_functions))
return new_df

这是我得到的错误：

TypeError: Invalid argument, not a string or column: [<function to_lower at 0x7f33f9a865f0>, <function remove_email at 0x7f33f9a93c20>, <function remove_url at 0x7f33f9a933b0>, <function remove_punctuation at 0x7f33f9a934d0>, <function remove_special_character at 0x7f33f9a935f0>, <function normalize_unicode at 0x7f33f9a93a70>, <function remove_number at 0x7f33f9a93170>, <function remove_whitespace at 0x7f33f9a93830>, <function remove_stopword at 0x7f33f9a93b00>, <function lemmatize_word at 0x7f33f9a8d4d0>, <function stem_word at 0x7f33f9a8d3b0>, <function check_spelling at 0x7f33f9a8d170>] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

我尝试将 preprocess_functions 转换为数组并点亮但没有结果

我该如何解决这个问题？

【问题讨论】：

【参考方案1】：

Spark udf 不能将函数作为输入，它只接受列或字符串表示的列名。看看这里的示例https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.udf.html?highlight=udf#pyspark.sql.functions.udf

【讨论】：

以上是关于将列表作为参数传递给 udf 方法的主要内容，如果未能解决你的问题，请参考以下文章