如何在火花中创建特定的时间范围

Posted 2023-04-15

技术标签:

【中文标题】如何在火花中创建特定的时间范围【英文标题】：How to Create specific timeframe in spark 【发布时间】：2019-08-24 16:13:51 【问题描述】：

我有跟踪器数据，我们存储跟踪器编号和到达时间戳。

+---------+-------------------+
|trackerno|              adate|
+---------+-------------------+
| 54046022|2019-03-01 18:00:00|
| 54030173|2019-03-01 17:45:00|
| 53451324|2019-03-01 17:50:00|
| 54002797|2019-03-01 18:30:00|
| 53471705|2019-03-01 17:59:00|

我想要 17:44:59 到 17:59:59 之间最后 15 分钟的数据。我正在使用火花应用程序。

预期输出：

+---------+-------------------+
|trackerno|              adate|
+---------+-------------------+
| 54030173|2019-03-01 17:45:00|
| 53451324|2019-03-01 17:50:00|
| 53471705|2019-03-01 17:59:00|

【问题讨论】：

我想要这个每隔几个小时（0-24）。每次都会有 15 分钟的窗口保持静止。我的意思是它总是有一个特定的开始和结束时间，因为 15 分钟的窗口可以在 0-24 小时之间的任何时间落下。请详细说明您能添加您到目前为止尝试过的内容吗？听起来像是作业问题，您没有任何尝试。 v_df.distinct().withColumn("timestamp", to_timestamp(unix_timestamp(col("adate")))) .withColumn("Date",date_format(col("timestamp")," yyyy-MM-dd")) .withColumn("time",date_format(col("timestamp"),"HH:mm:ss")) .withColumn("mydata",when(minute($"time"). between(44,59),1).otherwise(0)).show() 【参考方案1】：

你可以试试这样的：

  val df = Seq(
    (54046022, "2019-03-01 18:00:00"),
    (54030173, "2019-03-01 17:45:00"),
    (53451324, "2019-03-01 17:50:00"),
    (54002797, "2019-03-01 18:30:00"),
    (53471705, "2019-03-01 17:59:00")
  ).toDF("trackerno", "date")

  val tsDF = df.withColumn("ts", to_timestamp($"date"))

  val result = tsDF .
    select($"trackerno", $"date").
    where($"ts" >= to_timestamp(lit("2019-03-01 17:44:59")) &&
      $"ts" <= to_timestamp(lit("2019-03-01 17:59:59")))

  result.show(false)

【讨论】：

【参考方案2】：

您的问题不太清楚，特别是您将如何测量 15 分钟的窗口开始和结束时间。我只是根据我的一点理解来回答。

创建一个 15 分钟时间范围的窗口

from pyspark.sql.functions import window
grouped_window = df.groupBy(window("adate", "15 minutes"),"trackerno","adate").count()

这会给你带来这样的结果。

+------------------------------------------+---------+-------------------+-----+
|window                                    |trackerno|adate              |count|
+------------------------------------------+---------+-------------------+-----+
|[2019-03-01 17:45:00, 2019-03-01 18:00:00]|53451324 |2019-03-01 17:50:00|1    |
|[2019-03-01 18:30:00, 2019-03-01 18:45:00]|54002797 |2019-03-01 18:30:00|1    |
|[2019-03-01 17:45:00, 2019-03-01 18:00:00]|53471705 |2019-03-01 17:59:00|1    |
|[2019-03-01 18:00:00, 2019-03-01 18:15:00]|54046022 |2019-03-01 18:00:00|1    |
|[2019-03-01 17:45:00, 2019-03-01 18:00:00]|54030173 |2019-03-01 17:45:00|1    |
+------------------------------------------+---------+-------------------+-----+

from pyspark.sql import functions as f
from pyspark.sql import Window
w = Window.partitionBy('window')

grouped_window.select('adate', 'trackerno', f.count('count').over(w).alias('dupeCount')).sort('adate')\
    .where('dupeCount > 1')\
    .drop('dupeCount')\
    .show()

+-------------------+---------+
|              adate|trackerno|
+-------------------+---------+
|2019-03-01 17:45:00| 54030173|
|2019-03-01 17:50:00| 53451324|
|2019-03-01 17:59:00| 53471705|
+-------------------+---------+

【讨论】：

【参考方案3】：

df.where(minute($"ts")>=45)

【讨论】：

以上是关于如何在火花中创建特定的时间范围的主要内容，如果未能解决你的问题，请参考以下文章