pyspark 中的 UDF 能否返回与列不同的对象？

Posted 2023-04-13

技术标签:

【中文标题】pyspark 中的 UDF 能否返回与列不同的对象？【英文标题】：Can an UDF from pyspark return an object different from a column? 【发布时间】：2018-12-18 17:29:23 【问题描述】：

我想将一些函数应用于 pysaprk 数据框的列，并设法使用 UDF 执行此操作，但我希望返回不同于数据框的列、pandas 数据框、python 列表的另一个对象，等等

我正在使用分类器将每一列划分为类，但我希望结果是类的摘要，而不是 pyspark 数据框修改，我不知道这是否适用于 UDF

我的代码是这样的

import numpy as np
import pandas as pd
import pyspark 
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType, FloatType, DoubleType
sc = pyspark.SparkContext()
sqlCtx = SQLContext(sc)

df_pd = pd.DataFrame(
    data= 'Income':[12.0,45.0,24.0,24.0,54.0],
           'Debt':[23.0,4.0,1.0,6.0,3.0] )
df = sqlCtx.createDataFrame(df_pd)


# function
def clase(x):
    #n = np.mean(df_pd[name])
    #n = np.mean(df_pd["Ingresos"])
    n = 30
    m = 20
    if x>=n:
        x="good"
    elif x>=m:
        x="regular"
    else:
        x="bad"
    return x

# UDF
clase_udf = udf(lambda z: clase(z), StringType())

(
    df.select('Income',
              'Debt',
              clase_udf('Income').alias('new') )
    .show()
)

这给出了下一个表：

+------+----+-------+
|Income|Debt|    new|
+------+----+-------+
|  12.0|23.0|    bad|
|  45.0| 4.0|   good|
|  24.0| 1.0|regular|
|  24.0| 6.0|regular|
|  54.0| 3.0|   good|
+------+----+-------+

我想要的是得到这样的东西：

+-------+------------+
| Clases| Description|
+-------+------------+
|   good|   30<Income|
|regular|20<Income<30|
|    bad|   Income<20|
+-------+------------+

喜欢类的总结

【问题讨论】：

您不需要udf 来获取new 列。不过，我不清楚你在问什么。您想从数据中导出Description 吗？但是您为good, bad, regular 指定了削减... 所需的输出与您的输入有什么关系？如果您已经知道m 和n，为什么不直接使用

spark.createDataFrame([('good', '30&lt;Income'), ('regular', '20&lt;Income&lt;30'), ('bad', 'Income&lt;20')], ["Clases", "Description"])

？ 【参考方案1】：

你需要使用一个 udf 并返回一个 StringType ：

我把你的常量放出来，以防你希望它是全局的，并为多个功能同时修改它。

n = 30
m = 20

def description(x):
    if x >= n:
        x = str(n) + " < Income"
    elif x >= m:
        x = str(m) + " < Income < " + str(n)
    else:
        x = "Income < " + str(m)
    return x

description_udf = udf(lambda z: description(z), StringType())

df.select(
    clase_udf('Income').alias('Clases'),
    description_udf("Income").alias("Description")
).distinct().show()

输出是：

【讨论】：

是的，很抱歉搞错了。但我想要的只是一张像上一张一样的表格，而不是所有带有描述的数据框只需通过获取您想要的列来更改您的选择。我编辑了帖子啊，好吧，您想要一些与您的数据完全不相关的东西。好的，我又编辑了。只需使用“不同” 我想知道是否有更好的方法来做到这一点，因为我知道 UDF 正在制作整个列，然后获取不同的元素

以上是关于pyspark 中的 UDF 能否返回与列不同的对象？的主要内容，如果未能解决你的问题，请参考以下文章