根据时间戳差异删除行 - 火花

Posted 2023-03-31

技术标签:

【中文标题】根据时间戳差异删除行 - 火花【英文标题】：Delete rows based on difference in timestamp- spark 【发布时间】：2020-12-16 06:49:28 【问题描述】：

我有一个包含两列（“time_stamp”和“ID”）的 spark 数据框。

示例数据框：

      **ID**                **time_stamp**
       1AB               2015-01-23 08:23:16
       1AB               2015-01-23 08:54:40
      25CD               2015-01-23 09:02:20
       1AB               2015-01-23 10:15:36
       1AB               2015-01-23 12:38:40

如果时间戳与第一次出现的时间差小于 3 小时（保留最先出现的 ID），我想删除重复的 ID（保留第一次出现），如果差值大于 3 小时，我想保留 ID。

预期输出：

      **ID**                **time_stamp**
       1AB               2015-01-23 08:23:16
      25CD               2015-01-23 09:02:20
       1AB               2015-01-23 12:38:40

谢谢

【问题讨论】：

为什么 12:38:40 是一个单独的条目？距离 10:15:36 不到 3 小时。 @mck 为清楚起见编辑了问题。 3 小时从第一次出现开始计算 【参考方案1】：

使用 spark-sql：

val df = spark.sql(""" with t1(
select  '1AB' c1, '2015-01-23 08:23:16' c2 union all 
select  '1AB' c1, '2015-01-23 08:54:40' c2 union all 
select  '25CD' c1, '2015-01-23 09:02:20' c2 union all 
select  '1AB' c1, '2015-01-23 10:15:36' c2 union all 
select  '1AB' c1, '2015-01-23 12:38:40' c2 
) select c1 id, c2 timestamp from t1

""")

df.createOrReplaceTempView("view1")

spark.sql(""" 
select id,timestamp from (
select id, timestamp, unix_timestamp(timestamp)-unix_timestamp(mn)  diff from 
(select id, timestamp, min(timestamp) over(partition by id) mn from view1 ) temp
) temp2 
where  diff=0 or diff > 3*60*60
""").show(false)

+----+-------------------+
|id  |timestamp          |
+----+-------------------+
|1AB |2015-01-23 08:23:16|
|1AB |2015-01-23 12:38:40|
|25CD|2015-01-23 09:02:20|
+----+-------------------+

【讨论】：

哪个会有更好的性能，spark sql 还是@mck 提供的答案？ sparksql 有一个优势，因为引擎可以更好地解析和分析。无论如何，最好使用 2 种方法进行大量测试，然后得出结论。【参考方案2】：

使用自组开始以来的时间除以 3 小时的商分配一个组。

import pyspark.sql.functions as F
from pyspark.sql.window import Window

df2 = df.withColumn(
    "grouping",
    (
        (F.col("time_stamp").cast("long") - F.first("time_stamp").over(Window.partitionBy("ID").orderBy("time_stamp")).cast("long")) / (3*3600)
    ).cast("int")
).withColumn(
    "rn",
    F.row_number().over(Window.partitionBy("ID", "grouping").orderBy("time_stamp")
)
).filter("rn = 1").drop("grouping","rn")

df2.show()
+----+-------------------+
|  ID|         time_stamp|
+----+-------------------+
| 1AB|2015-01-23 08:23:16|
| 1AB|2015-01-23 12:38:40|
|25CD|2015-01-23 09:02:20|
+----+-------------------+

【讨论】：

【参考方案3】：

Select t.id, t.timestamp, count(id), Timestampdiff(hour,timestamp, t. Prev_timestamp) 选择id, timestamp, LAG(timestamp, 1) over (partition by id order by id) Prev_timestamp 从表）吨有count(id)

1 和 timestampdiff (HOUR,timestamp, prev_timestamp) >3

【讨论】：

以上是关于根据时间戳差异删除行 - 火花的主要内容，如果未能解决你的问题，请参考以下文章