在 pyspark sql 中查找两个时间戳之间的差异

Posted

技术标签:

【中文标题】在 pyspark sql 中查找两个时间戳之间的差异【英文标题】:Finding difference between two time stamp in pyspark sql 【发布时间】:2018-08-08 16:08:16 【问题描述】:

在表结构下方,您可以注意到列名

cal_avg_latency = spark.sql("SELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM `SFSC_Incident_Census_view` WHERE EXTRACT(DATE from ReceivedDtTmTS) == EXTRACT(DATE from OnSceneDtTmTS) GROUP BY UnitType ORDER BY latency ASC")

错误:

ParseException: "\nmismatched input 'FROM' expecting <EOF>(line 1, pos 122)\n\n== SQL ==\nSELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM SFSC_Incident_Census_view WHERE EXTRACT((DATE FROM ReceivedDtTmTS) == EXTRACT(DATE FROM OnSceneDtTmTS)) GROUP BY UnitType ORDER BY latency ASC\n--------------------------------------------------------------------------------------------------------------------------^^^\n"

错误在 WHERE 条件下,但即使我的 TIMESTAMP_DIFF 函数也不起作用

cal_avg_latency = spark.sql("SELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM SFSC_Incident_Census_view  GROUP BY UnitType ORDER BY latency ASC")

错误:

AnalysisException: "Undefined function: 'TIMESTAMP_DIFF'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 27"

【问题讨论】:

【参考方案1】:

错误信息看起来很清楚。 Hive 没有TIMESTAMP_DIFF 函数。

如果您的列已经被适当地转换为 timestamp 类型,您可以直接减去它们。否则,您可以显式地转换它们,并获取差异:

SELECT ROUND(AVG(MINUTE(CAST(OnSceneDtTmTS AS timestamp) - CAST(ReceivedDtTmTS AS timestamp))), 2) AS latency

【讨论】:

%sql SELECT UnitType, ROUND(AVG(MINUTE(CAST(OnSceneDtTmTS AS timestamp) - CAST(ReceivedDtTmTS AS timestamp))), 2) AS latency FROM SFSC_Incident_Census_view GROUP BY UnitType 它给了我以下错误:Error in SQL statement: AnalysisException: cannot resolve '(CAST(sfsc_incident_census_view.OnSceneDtTmTS` AS TIMESTAMP)`【参考方案2】:

我已经使用 pyspark 查询解决了这个问题。

from pyspark.sql import functions as F
import pyspark.sql.functions as func
timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
timeDiff = (F.unix_timestamp('OnSceneDtTmTS', format=timeFmt)
        - F.unix_timestamp('ReceivedDtTmTS', format=timeFmt))
FSCDataFrameTsDF = FSCDataFrameTsDF.withColumn("Duration", timeDiff)
#convert seconds to minute and round the seconds for further use. 
FSCDataFrameTsDF = FSCDataFrameTsDF.withColumn("Duration_minutes",func.round(FSCDataFrameTsDF.Duration / 60.0))

输出:

【讨论】:

以上是关于在 pyspark sql 中查找两个时间戳之间的差异的主要内容,如果未能解决你的问题,请参考以下文章

在pyspark数据框中的两个日期之间生成每月时间戳

在另一列上查找最近的时间戳并在新列中添加值 PySpark

在 PySpark 中查找两个数据帧之间的变化

Pyspark:两个日期之间的差异(Cast TimestampType,Datediff)

Pyspark DataFrame:查找两个 DataFrame 之间的差异(值和列名)

基于pyspark中仅一列的两个DataFrame之间的差异[重复]