Spark Hive - 具有窗口功能的 UDFArgumentTypeException？

Posted 2023-03-31

技术标签:

【中文标题】Spark Hive - 具有窗口功能的 UDFArgumentTypeException？【英文标题】：Spark Hive - UDFArgumentTypeException with window function? 【发布时间】：2016-07-20 08:05:32 【问题描述】：

我有以下df：

+------------+----------------------+-------------------+                                 
|increment_id|base_subtotal_incl_tax|          eventdate|                                 
+------------+----------------------+-------------------+                                 
|        1086|            14470.0000|2016-06-14 09:54:12|                                 
|        1086|            14470.0000|2016-06-14 09:54:12|                                 
|        1086|            14470.0000|2015-07-14 09:54:12|                                 
|        1086|            14470.0000|2015-07-14 09:54:12|                                 
|        1086|            14470.0000|2015-07-14 09:54:12|                                 
|        1086|            14470.0000|2015-07-14 09:54:12|                                 
|        1086|             1570.0000|2015-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
+------------+----------------------+-------------------+

我正在尝试将窗口函数运行为：

WindowSpec window = Window.partitionBy(df.col("id")).orderBy(df.col("eventdate").desc());
df.select(df.col("*"),rank().over(window).alias("rank")) //error for this line
         .filter("rank <= 2")
         .show();

我想得到的是每个用户的最后两个条目（最后一个是最新日期，但由于它是按降序排列的，前两行）：

+------------+----------------------+-------------------+                                 
|increment_id|base_subtotal_incl_tax|          eventdate|                                 
+------------+----------------------+-------------------+                                 
|        1086|            14470.0000|2016-06-14 09:54:12|                                 
|        1086|            14470.0000|2016-06-14 09:54:12|   
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                     
+------------+----------------------+-------------------+

但我明白了：

+------------+----------------------+-------------------+----+
|increment_id|base_subtotal_incl_tax|          eventdate|rank|                            
+------------+----------------------+-------------------+----+                            
|        5555|            14470.0000|2014-07-14 09:54:12|   1|                            
|        5555|            14470.0000|2014-07-14 09:54:12|   1|                            
|        5555|            14470.0000|2014-07-14 09:54:12|   1|                            
|        5555|            14470.0000|2014-07-14 09:54:12|   1|                            
|        1086|            14470.0000|2016-06-14 09:54:12|   1|                            
|        1086|            14470.0000|2016-06-14 09:54:12|   1|                            
+------------+----------------------+-------------------+----+

我错过了什么？

[OLD] - 本来我有一个错误，现在解决了：

WindowSpec window = Window.partitionBy(df.col("id"));
df.select(df.col("*"),rank().over(window).alias("rank")) //error for this line
         .filter("rank <= 2")
         .show();

但是，对于上面标有注释的行，这会返回错误Exception in thread "main" org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: One or more arguments are expected.。我错过了什么？这个错误是什么意思？谢谢！

【问题讨论】：

【参考方案1】：

rank窗口函数需要一个带有orderBy的窗口，例如子句：

WindowSpec window = Window.partitionBy(df.col("id")).orderBy(df.col("payment"));

没有订单就毫无意义，因此会出错。

【讨论】：

谢谢！我会接受你的回答，但更新了我的问题。如果您也能帮我解决这个问题，我将不胜感激。

以上是关于Spark Hive - 具有窗口功能的 UDFArgumentTypeException？的主要内容，如果未能解决你的问题，请参考以下文章

如何在Hive / Spark SQL中使用窗口功能删除重叠部分

倾斜的窗口函数和 Hive 源分区？

如何在 Hive/Spark SQL 中使用窗口函数删除重叠

使用窗口 Hive 或 spark scala 进行数据排列

Hive UDAF开发详解