将列表作为参数传递给 udf 方法
Posted
技术标签:
【中文标题】将列表作为参数传递给 udf 方法【英文标题】:Pass list to udf method as a parameter 【发布时间】:2021-05-16 20:15:58 【问题描述】:使用文本处理库https://github.com/berknology/text-preprocessing
我想将 preprocess_functions 作为参数传递给 preprocess_text 方法
使用下面的例子:
def preprocess_text_spark(df: SparkDataFrame,
target_column: str,
preprocessed_column_name: str = 'preprocessed_text'
) -> SparkDataFrame:
""" Preprocess text in a column of a PySpark DataFrame by leveraging PySpark UDF to preprocess text in parallel """
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, remove_special_character, normalize_unicode, remove_number, remove_whitespace, remove_stopword, lemmatize_word, stem_word, check_spelling]
_preprocess_text = udf(preprocess_text, StringType())
new_df = df.withColumn(preprocessed_column_name, _preprocess_text(df[target_column],preprocess_functions))
return new_df
这是我得到的错误:
TypeError: Invalid argument, not a string or column: [<function to_lower at 0x7f33f9a865f0>, <function remove_email at 0x7f33f9a93c20>, <function remove_url at 0x7f33f9a933b0>, <function remove_punctuation at 0x7f33f9a934d0>, <function remove_special_character at 0x7f33f9a935f0>, <function normalize_unicode at 0x7f33f9a93a70>, <function remove_number at 0x7f33f9a93170>, <function remove_whitespace at 0x7f33f9a93830>, <function remove_stopword at 0x7f33f9a93b00>, <function lemmatize_word at 0x7f33f9a8d4d0>, <function stem_word at 0x7f33f9a8d3b0>, <function check_spelling at 0x7f33f9a8d170>] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
我尝试将 preprocess_functions 转换为数组并点亮但没有结果
我该如何解决这个问题?
【问题讨论】:
【参考方案1】:Spark udf 不能将函数作为输入,它只接受列或字符串表示的列名。看看这里的示例https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.udf.html?highlight=udf#pyspark.sql.functions.udf
【讨论】:
以上是关于将列表作为参数传递给 udf 方法的主要内容,如果未能解决你的问题,请参考以下文章
PySpark - 将列表作为参数传递给 UDF + 迭代数据框列添加