为啥代码有问题？我连接到集群

Posted 2023-04-14

技术标签:

【中文标题】为啥代码有问题？我连接到集群【英文标题】：Why is there a problem with the code? I am connected to the clusters为什么代码有问题？我连接到集群 【发布时间】：2019-09-23 04:56:46 【问题描述】：

我试图应用 UDF 函数来舍入这些 pct，也许有更好的方法，我对此持开放态度，因为我是 pyspark 的新手。当我删除 udf 函数以放弃对数字进行四舍五入时，它起作用了，因此我对数据框充满信心。

所以各位，天才，请帮助我，爱与和平

我在 databricks 中尝试了 spqrk.sql 来获取这个数据框，它看起来不错。

代码如下：

from pyspark.sql.types import IntegerType

round_func = udf(lambda x:round(x,2), IntegerType())

q2_res = q2_res.withColumn('pct_DISREGARD', round_func(col('pct')))

display(q2_res)

错误： AttributeError: 'NoneType' 对象没有属性 '_jvm'

【问题讨论】：

【参考方案1】：

显然我们不能将任何pyspark.sql.functions 与udf 一起使用。 this线程中给出了详细的解释。您正在尝试使用 round 函数，因此它不起作用，因为它仅适用于列。我们可以用更简单的方式实现相同的功能：

from pyspark.sql.types import IntegerType
import pyspark.sql.functions as f

q2_res = q2_res.withColumn('pct_DISREGARD', f.round('pct', 2).astype(IntegerType()))

通常建议尽可能避免使用 UDF，因为它们通常比原生数据帧操作慢。

【讨论】：

以上是关于为啥代码有问题？我连接到集群的主要内容，如果未能解决你的问题，请参考以下文章