SQL：将时间戳与仅时间参数相匹配，以对多天的唯一时间进行分组和计数

Posted 2023-04-15

技术标签:

【中文标题】SQL：将时间戳与仅时间参数相匹配，以对多天的唯一时间进行分组和计数【英文标题】：SQL: Match timestamps with time-only parameter to group and count unique times across multiple days 【发布时间】：2021-05-17 21:32:15 【问题描述】：

使用 SQL 或 Pyspark，我想计算 2 个月时间范围内时间戳中的唯一次数。我想查看将行记录到表中的频率分布。这是因为我知道时间为 00:00:00 的时间戳占很大比例，但我想知道与其他时间相比有多大以及比例。

此查询对最常见的日期时间进行分组和计数，但我需要排除日期并且只有时间。显然，这并不常见。

select timestamp,
    count(*) as count
from table_name
where timestamp between '2021-01-01' and '2021-02-28'
group by 1
order by 2 desc

SQL/Pyspark 在 Zeppelin Notebook 的 Spark DB 上运行。

时间戳如下所示：2021-01-01 02:07:55

【问题讨论】：

【参考方案1】：

也许是这样的？

select 
  date_format(timestamp, "H m s") as dataTime,
  count(*) as count
from table_name
where timestamp between '2021-01-01' and '2021-02-28'
group by date_format(timestamp, "H m s") 
order by 2 desc

使用保留字 (timestamp) 命名字段不是一个好主意。

来自spark documentation。

【讨论】：

保留字的权利，我改成datetime。为了简化此查询，可以将 group by 替换为 group by 1，因为它重复了 select 子句。还可以修改选择查询以匹配时间戳给出的格式。 date_format(timestamp, "HH:mm:ss")【参考方案2】：

取决于您的timestamp 列的类型，您可以提取hour、minute、second，如果它是TimestampType（使用lpad 添加前导零），或者使用regexp_extract 如果是StringType

from pyspark.sql import functions as F

# if your ts column has TimestampType
(df
    .withColumn('ts', F.col('ts').cast('timestamp')) # my assumption ts is timestamp
    .withColumn('time_only', F.concat(
        F.lpad(F.hour('ts'), 2, '0'),
        F.lit(':'),
        F.lpad(F.minute('ts'), 2, '0'),
        F.lit(':'),
        F.lpad(F.second('ts'), 2, '0')
    ))
    .show()
)

# if your ts column is StringType
(df
    .withColumn('ts', F.col('ts').cast('string')) # my assumption ts is string
    .withColumn('time_only', F.regexp_extract('ts', '\d2:\d2:\d2', 0))
    .show()
)

# +-------------------+---------+
# |                 ts|time_only|
# +-------------------+---------+
# |2019-01-15 03:00:00| 03:00:00|
# |2019-01-15 20:00:00| 20:00:00|
# |2019-01-15 19:00:00| 19:00:00|
# |2019-01-15 11:00:00| 11:00:00|
# +-------------------+---------+

【讨论】：

我的时间戳是 TimestampType，如果不是，感谢您提供答案。我添加了这个来进行分组和排序：df.groupBy(F.col('time_only')).count().orderBy(F.col('count').desc()).show() 我也喜欢您在多行语句周围使用括号而不是在每行末尾放置 \。

以上是关于SQL：将时间戳与仅时间参数相匹配，以对多天的唯一时间进行分组和计数的主要内容，如果未能解决你的问题，请参考以下文章