基于列单调递增的ID [重复]

Posted 2023-03-31

技术标签:

【中文标题】基于列单调递增的ID [重复]【英文标题】：Monotonically increasing ID based on column [duplicate] 【发布时间】：2018-06-15 21:02:14 【问题描述】：

我正在尝试向我的 spark DF 添加一个新列。我了解可以使用以下代码：

df.withColumn("row",monotonically_increasing_id)

但我的用例是：

输入 DF：

col value
  1
  2
  2
  3
  3
  3

输出DF：

col_value      identifier
  1               1
  2               1
  2               2
  3               1
  3               2
  3               3

关于使用 monotonically_increasing 或 rowWithUniqueIndex 获得此信息的任何建议。

【问题讨论】：

【参考方案1】：

根据您的要求，一种方法是使用row_number Window 函数：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df = Seq(
  1, 2, 2, 3, 3, 3
).toDF("col_value")

val window = Window.partitionBy("col_value").orderBy("col_value")
df.withColumn("identifier", row_number().over(window)).
  orderBy("col_value").
  show
// +---------+----------+
// |col_value|identifier|
// +---------+----------+
// |        1|         1|
// |        2|         1|
// |        2|         2|
// |        3|         1|
// |        3|         2|
// |        3|         3|
// +---------+----------+

【讨论】：

以上是关于基于列单调递增的ID [重复]的主要内容，如果未能解决你的问题，请参考以下文章

基于MySql中的两列删除重复记录[重复]

SQL Server标识列重新编号[重复]

通过脚本使列自动递增[重复]

仅基于一个不同的列获取数据表的所有列值[重复]

分布式ID生成