在 pyspark sql 中查找两个时间戳之间的差异
Posted
技术标签:
【中文标题】在 pyspark sql 中查找两个时间戳之间的差异【英文标题】:Finding difference between two time stamp in pyspark sql 【发布时间】:2018-08-08 16:08:16 【问题描述】:在表结构下方,您可以注意到列名
cal_avg_latency = spark.sql("SELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM `SFSC_Incident_Census_view` WHERE EXTRACT(DATE from ReceivedDtTmTS) == EXTRACT(DATE from OnSceneDtTmTS) GROUP BY UnitType ORDER BY latency ASC")
错误:
ParseException: "\nmismatched input 'FROM' expecting <EOF>(line 1, pos 122)\n\n== SQL ==\nSELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM SFSC_Incident_Census_view WHERE EXTRACT((DATE FROM ReceivedDtTmTS) == EXTRACT(DATE FROM OnSceneDtTmTS)) GROUP BY UnitType ORDER BY latency ASC\n--------------------------------------------------------------------------------------------------------------------------^^^\n"
错误在 WHERE 条件下,但即使我的 TIMESTAMP_DIFF 函数也不起作用
cal_avg_latency = spark.sql("SELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM SFSC_Incident_Census_view GROUP BY UnitType ORDER BY latency ASC")
错误:
AnalysisException: "Undefined function: 'TIMESTAMP_DIFF'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 27"
【问题讨论】:
【参考方案1】:错误信息看起来很清楚。 Hive 没有TIMESTAMP_DIFF
函数。
如果您的列已经被适当地转换为 timestamp
类型,您可以直接减去它们。否则,您可以显式地转换它们,并获取差异:
SELECT ROUND(AVG(MINUTE(CAST(OnSceneDtTmTS AS timestamp) - CAST(ReceivedDtTmTS AS timestamp))), 2) AS latency
【讨论】:
%sql SELECT UnitType, ROUND(AVG(MINUTE(CAST(OnSceneDtTmTS AS timestamp) - CAST(ReceivedDtTmTS AS timestamp))), 2) AS latency FROM SFSC_Incident_Census_view GROUP BY UnitType
它给了我以下错误:Error in SQL statement: AnalysisException: cannot resolve '(CAST(sfsc_incident_census_view.
OnSceneDtTmTS` AS TIMESTAMP)`【参考方案2】:
我已经使用 pyspark 查询解决了这个问题。
from pyspark.sql import functions as F
import pyspark.sql.functions as func
timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
timeDiff = (F.unix_timestamp('OnSceneDtTmTS', format=timeFmt)
- F.unix_timestamp('ReceivedDtTmTS', format=timeFmt))
FSCDataFrameTsDF = FSCDataFrameTsDF.withColumn("Duration", timeDiff)
#convert seconds to minute and round the seconds for further use.
FSCDataFrameTsDF = FSCDataFrameTsDF.withColumn("Duration_minutes",func.round(FSCDataFrameTsDF.Duration / 60.0))
输出:
【讨论】:
以上是关于在 pyspark sql 中查找两个时间戳之间的差异的主要内容,如果未能解决你的问题,请参考以下文章
Pyspark:两个日期之间的差异(Cast TimestampType,Datediff)