在 Spark SQL 中将 long 类型的列转换为 calendarinterval 类型

Posted

技术标签:

【中文标题】在 Spark SQL 中将 long 类型的列转换为 calendarinterval 类型【英文标题】:Convert long type column to calendarinterval type in Spark SQL 【发布时间】:2020-04-03 04:59:13 【问题描述】:

两个查询:

如何将具有以秒为单位的数字的 long 类型列转换为 Python 中具有的 calendarinterval 类型 Spark SQL?

如何将以下代码转换为纯 Spark SQL 查询:

from pyspark.sql.functions import unix_timestamp
df2 = df.withColumn(
    "difference_duration",
    unix_timestamp("CAL_COMPLETION_TIME") - unix_timestamp("Prev_Time")
    )

示例数据帧 SS:

基本上我试图在 Spark SQL 中实现以下 PGSQL 查询:

case 
    when t1.prev_time <> t1.prev_time_calc and t1."CAL_COMPLETION_TIME" - t1.prev_time < interval '30 min' 
      then t1.next_time_calc - t1.prev_time_calc
    when (t1.next_time <> t1.next_time_calc and t1.next_time - t1."CAL_COMPLETION_TIME" < interval '30 min') or (t1.next_time - t1."CAL_COMPLETION_TIME" < interval '30 min')
      then t1.next_time_calc - t1."CAL_COMPLETION_TIME"
  else null
  end min_diff

但这部分 t1."CAL_COMPLETION_TIME" - t1.prev_time 抛出以下错误:

AnalysisException: "cannot resolve '(t1.`CAL_COMPLETION_TIME` - t1.`prev_time`)' due to data type mismatch: '(t1.`CAL_COMPLETION_TIME` - t1.`prev_time`)' requires (numeric or calendarinterval) type, not timestamp;

【问题讨论】:

能否请您分享数据框 df 的示例行。 问题陈述更加清晰。 【参考方案1】:

您不能减去时间戳,您需要将它们转换为秒。因此,您正在寻找的是将时间戳列转换为 long/bigint 为你减去divide by 60得到分钟值,然后看看是不是less than 30

#example=df1
#both columns are of type Timestamp
+-------------------+-------------------+
|          prev_time|CAL_COMPLETION_TIME|
+-------------------+-------------------+
|2019-04-26 01:19:10|2019-04-26 01:19:35|
+-------------------+-------------------+

Pyspark:

df1.withColumn("sub", F.when(((F.col("CAL_COMPLETION_TIME").cast("long")-F.col("prev_time").cast("long"))/60 < 30), F.lit("LESSTHAN30")).otherwise(F.lit("GREATERTHAN"))).show()

+-------------------+-------------------+----------+
|          prev_time|CAL_COMPLETION_TIME|       sub|
+-------------------+-------------------+----------+
|2019-04-26 01:19:10|2019-04-26 01:19:35|LESSTHAN30|
+-------------------+-------------------+----------+

Spark.sql

df1.createOrReplaceTempView("df1")
spark.sql("select prev_time, CAL_COMPLETION_TIME, IF(((CAST(CAL_COMPLETION_TIME as bigint) - CAST(prev_time as bigint))/60)<30,'LESSTHAN30','GREATER') as difference_duration from df1").show()

+-------------------+-------------------+-------------------+
|          prev_time|CAL_COMPLETION_TIME|difference_duration|
+-------------------+-------------------+-------------------+
|2019-04-26 01:19:10|2019-04-26 01:19:35|         LESSTHAN30|
+-------------------+-------------------+-------------------+

【讨论】:

以上是关于在 Spark SQL 中将 long 类型的列转换为 calendarinterval 类型的主要内容,如果未能解决你的问题,请参考以下文章

将具有唯一值的列转置到行的 SQL 查询

在 Hive 中将 Long 转换为时间戳

Spark Parquet统计(最小/最大)集成

将列转置为行 SQL Server

在 Spark SQL 中将结构转换为映射

如何让 Spark SQL 导入没有“L”后缀的 Long?