在 UDF 正文中将列作为参数传递

Posted 2023-04-13

技术标签:

【中文标题】在 UDF 正文中将列作为参数传递【英文标题】：Pass a column as an argument inside UDF Body 【发布时间】：2019-11-15 17:14:14 【问题描述】：

我有以下数据框：

+---------+----------+---------------------+------------+-----------------+
|CAR_OWNER|MOTOR_TYPE|ELECTRIFICATION_RATIO|ENERGY_IN_US|ENERGY_OUTSIDE_US|
+---------+----------+---------------------+------------+-----------------+
|     Alex|Electrical|                  1.0|          15|                0|
|      Bob| Thermical|                  0.0|           0|                5|
|   Claire|    Hybrid|                  0.5|           0|               10|
+---------+----------+---------------------+------------+-----------------+

使用以下函数：

def associateEnergy(motorType: String, consumedEnergy: Float, consumptionType: String, elecRatio: Float): Float =
motorType match 
  case "Electrical" => if (consumptionType == "ELEC") consumedEnergy else 0f
  case "Thermical"  => if (consumptionType == "THERM") consumedEnergy else 0f
  case "Hybrid"     => if (consumptionType == "ELEC") consumedEnergy * elecRatio else consumedEnergy * (1 - elecRatio)

我想计算以下字段：

ELEC_ENERGY_IN_US ELEC_ENERGY_OUTSIDE_US THERM_ENERGY_IN_US THERM_ENERGY_IN_US

我可以使用以下 Udf 来做到这一点：

def associateEnergyUdf(consumptionType: String) = udf(
    (motorType: String, consumedEnergy: Float, elecRatio: Float) =>
      associateEnergy(motorType, consumedEnergy, consumptionType, elecRatio)
  )

还有这段代码：

inputDf
    .withColumn("ELEC_ENERGY_IN_US", associateEnergyUdf("ELEC")(col("MOTOR_TYPE"), col("ENERGY_IN_US"), col("ELECTRIFICATION_RATIO")))
    .withColumn("ELEC_ENERGY_OUTSIDE_US", associateEnergyUdf("ELEC")(col("MOTOR_TYPE"), col("ENERGY_OUTSIDE_US"), col("ELECTRIFICATION_RATIO")))
    .withColumn("THERM_ENERGY_IN_US", associateEnergyUdf("THERM")(col("MOTOR_TYPE"), col("ENERGY_IN_US"), col("ELECTRIFICATION_RATIO")))
    .withColumn("THERM_ENERGY_OUTSIDE_US", associateEnergyUdf("THERM")(col("MOTOR_TYPE"), col("ENERGY_OUTSIDE_US"), col("ELECTRIFICATION_RATIO")))

但我不想重复四次 col("MOTOR_TYPE") 和 col("ELECTRIFICATION_RATIO") 参数。所以我创建了以下 udf ：

def associateEnergyReducedUdf(consumptionType: String)(consumedEnergyCol: Column) = udf(
    () => associateEnergyUdf(consumptionType)(col("MOTOR_TYPE"), consumedEnergyCol, col("ELECTRIFICATION_RATIO"))
  )

所以我只需要打电话：

inputDf
    .withColumn("ELEC_ENERGY_IN_US", associateEnergyReducedUdf("ELEC")(col("ENERGY_IN_US"))())
    .withColumn("ELEC_ENERGY_OUTSIDE_US", associateEnergyReducedUdf("ELEC")(col("ENERGY_OUTSIDE_US"))())
    .withColumn("THERM_ENERGY_IN_US", associateEnergyReducedUdf("THERM")(col("ENERGY_IN_US"))())
    .withColumn("THERM_ENERGY_OUTSIDE_US", associateEnergyReducedUdf("THERM")(col("ENERGY_OUTSIDE_US"))())

但这会导致以下错误：

java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported

如何在不重复不必要的参数的情况下实现这一点？

【问题讨论】：

【参考方案1】：

在您的简化 UDF 中，您将一个已经存在的 UDF 包装到另一个 UDF 中。试试吧：

def associateEnergyReducedUdf(consumptionType: String)(consumedEnergyCol: Column) =
      () => associateEnergyUdf(consumptionType)(col("MOTOR_TYPE"), consumedEnergyCol, col("ELECTRIFICATION_RATIO"))

这行得通

【讨论】：

以上是关于在 UDF 正文中将列作为参数传递的主要内容，如果未能解决你的问题，请参考以下文章