在 PySpark 上将日期时间转换为日期
Posted
技术标签:
【中文标题】在 PySpark 上将日期时间转换为日期【英文标题】:Convert datetime to date on PySpark 【发布时间】:2021-01-23 21:19:26 【问题描述】:我有一个包含两列 "date" (dtype: string)
和 "modified" (dtype: bigint)
的数据框,如下所示:
+-------------------------------------+-------------+
| date| modified|
+-------------------------------------+-------------+
|Mon, 18 Dec 2017 22:52:37 +0000 (UTC)|1513637587000|
| Mon, 18 Dec 2017 22:52:23 +0000|1513637587000|
| Mon, 18 Dec 2017 22:52:03 +0000|1513637587000|
|Mon, 18 Dec 2017 22:51:43 +0000 (UTC)|1513637527000|
| Mon, 18 Dec 2017 22:51:31 +0000|1513637527000|
| Mon, 18 Dec 2017 22:51:38 +0000|1513637527000|
| Mon, 18 Dec 2017 22:51:09 +0000|1513637526000|
| Mon, 18 Dec 2017 22:50:55 +0000|1513637466000|
| Mon, 18 Dec 2017 22:50:35 +0000|1513637466000|
| Mon, 18 Dec 2017 17:49:35 -0500|1513637407000|
+-------------------------------------+-------------+
如何从两列中的任何一个中提取YYYY-mm-dd (2017-12-18)
?我尝试使用unix_timestamp
和to_timestamp
,但没有任何效果。它给出了null
的值。
【问题讨论】:
【参考方案1】:您可以使用from_unixtime
将bigint unix时间戳转换为时间戳类型,然后再转换为日期类型:
import pyspark.sql.functions as F
df2 = df.withColumn('parsed_date', F.from_unixtime(F.col('modified')/1000).cast('date'))
df2.show()
+--------------------+-------------+-----------+
| date| modified|parsed_date|
+--------------------+-------------+-----------+
|Mon, 18 Dec 2017 ...|1513637587000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637587000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637587000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637527000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637527000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637527000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637526000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637466000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637466000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637407000| 2017-12-18|
+--------------------+-------------+-----------+
【讨论】:
【参考方案2】:这里发布了许多关于如何在 Spark 中将字符串转换为日期的问题(Convert pyspark string to date format、Convert date from String to Date format in Dataframes...)。
由于modified
列是以毫秒 为单位的纪元时间,因此您将得到 null,您需要将其除以 1000 以获得秒数,然后再将其转换为时间戳:
from pyspark.sql import functions as F
df1 = df.withColumn(
"modified_as_date",
F.to_timestamp(F.col("modified") / 1000).cast("date")
).withColumn(
"date_as_date",
F.to_date("date", "EEE, dd MMM yyyy HH:mm:ss")
)
df1.show(truncate=False)
#+-------------------------------------+-------------+----------------+------------+
#|date |modified |modified_as_date|date_as_date|
#+-------------------------------------+-------------+----------------+------------+
#|Mon, 18 Dec 2017 22:52:37 +0000 (UTC)|1513637587000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:52:23 +0000 |1513637587000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:52:03 +0000 |1513637587000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:51:43 +0000 (UTC)|1513637527000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:51:31 +0000 |1513637527000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:51:38 +0000 |1513637527000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:51:09 +0000 |1513637526000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:50:55 +0000 |1513637466000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:50:35 +0000 |1513637466000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 17:49:35 -0500 |1513637407000|2017-12-18 |2017-12-18 |
#+-------------------------------------+-------------+----------------+------------+
【讨论】:
以上是关于在 PySpark 上将日期时间转换为日期的主要内容,如果未能解决你的问题,请参考以下文章