PySpark SQL 中的日期之间的差异
Posted
技术标签:
【中文标题】PySpark SQL 中的日期之间的差异【英文标题】:Difference between dates in PySpark SQL 【发布时间】:2018-11-15 23:22:09 【问题描述】:所以我需要计算两个日期之间的差异。我知道 PySpark SQL 确实支持DATEDIFF
,但只支持一天。我做了一个计算差异的函数,但我没有输出。代码如下:
...
logRowsDF.createOrReplaceTempView("taxiTable")
#first way
spark.registerFunction("test", lambda x,y: ((dt.strptime(x, '%Y-%m-%d %H:%M:%S') - dt.strptime(y, '%Y-%m-%d %H:%M:%S')).days * 24 * 60) + ((dt.strptime(x, '%Y-%m-%d %H:%M:%S') - dt.strptime(y, '%Y-%m-%d %H:%M:%S')).seconds/60))
#second
spark.registerFunction("test", lambda x,y: countTime(x,y))
#third
diff = udf(countTime)
#trying to call that function that way
listIpsDF = spark.sql('SELECT diff(pickup,dropoff) AS TIME FROM taxiTable')
功能:
def countTime(time1, time2):
fmt = '%Y-%m-%d %H:%M:%S'
d1 = dt.strptime(time1, fmt)
d2 = dt.strptime(time2, fmt)
diff = d2 -d1
diff_minutes = (diff.days * 24 * 60) + (diff.seconds/60)
return str(diff_minutes)
它只是不起作用。你能帮帮我吗?
一个例子:
+-------------------+-------------------+
| pickup| dropoff|
+-------------------+-------------------+
|2018-01-01 00:21:05|2018-01-01 00:24:23|
|2018-01-01 00:44:55|2018-01-01 01:03:05|
| ... |
+-------------------+-------------------+
预期输出(以分钟为单位):
+-------------------+
| datediff |
+-------------------+
| 3.3 |
| 18.166666666666668|
| ... |
+-------------------+
【问题讨论】:
请添加您的输入和预期输出的示例。 @cronoik 完成,我添加了一些示例 【参考方案1】:实际上,我不确定您的错误出在哪里,因为您的某些示例代码没有意义(例如,您注册了一个名为“test”的函数,但您在未注册的 sql 语句中使用了函数 diff - > 这应该会导致错误消息)。无论如何,请在下面找到您的代码的工作示例:
from pyspark.sql.functions import udf
from datetime import datetime as dt
l = [('2018-01-01 00:21:05','2018-01-01 00:24:23')
,('2018-01-01 00:44:55', '2018-01-01 01:03:05')
]
df = spark.createDataFrame(l,['begin','end'])
df.registerTempTable('test')
def countTime(time1, time2):
fmt = '%Y-%m-%d %H:%M:%S'
d1 = dt.strptime(time1, fmt)
d2 = dt.strptime(time2, fmt)
diff = d2 - d1
diff_minutes = (diff.days * 24 * 60) + (diff.seconds/60)
return str(diff_minutes)
diff = udf(countTime)
sqlContext.registerFunction("diffSQL", lambda x, y: countTime(x,y))
print('column expression udf works')
df.withColumn('bla', diff(df.begin,df.end)).show()
print('sql udf works')
spark.sql('select diffSQL(begin,end) from test').show()
示例输出:
column expression udf works
+-------------------+-------------------+------------------+
| begin| end| bla|
+-------------------+-------------------+------------------+
|2018-01-01 00:21:05|2018-01-01 00:24:23| 3.3|
|2018-01-01 00:44:55|2018-01-01 01:03:05|18.166666666666668|
+-------------------+-------------------+------------------+
sql udf works
+-------------------+
|diffSQL(begin, end)|
+-------------------+
| 3.3|
| 18.166666666666668|
+-------------------+
【讨论】:
非常感谢!我忘记了“SQLContext”导入和“sqlContext = SQLContext(sc)”行。再次感谢你:D以上是关于PySpark SQL 中的日期之间的差异的主要内容,如果未能解决你的问题,请参考以下文章