新创建的列在 pyspark 数据框中显示空值
Posted
技术标签:
【中文标题】新创建的列在 pyspark 数据框中显示空值【英文标题】:newly created column shows null values in pyspark dataframe 【发布时间】:2020-10-14 20:21:26 【问题描述】:我想添加一个列来计算两个两个时间戳值之间的时间差。为了做到这一点,我首先添加一个当前日期时间的列,在此处定义为current_datetime
:
import datetime
#define current datetime
now = datetime.datetime.now()
#Getting Current date and time
current_datetime=now.strftime("%Y-%m-%d %H:%M:%S")
print(now)
然后我想将current_datetime
作为列值添加到df并计算差异
import pyspark.sql.functions as F
productsDF = productsDF\
.withColumn('current_time', when(col('Quantity')>1, current_datetime))\
.withColumn('time_diff',\
(F.unix_timestamp(F.to_timestamp(F.col('current_time')))) -
(F.unix_timestamp(F.to_timestamp(F.col('Created_datetime'))))/F.lit(3600)
)
然而,输出只是空值。
productsDF.select('current_time','Created_datetime','time_diff').show()
+------------+-------------------+---------+
|current_time| Created_datetime|time_diff|
+------------+-------------------+---------+
| null|2019-10-12 17:09:18| null|
| null|2019-12-03 07:02:07| null|
| null|2020-01-16 23:10:08| null|
| null|2020-01-21 15:38:39| null|
| null|2020-01-21 15:14:55| null|
新列是用字符串类型和双精度类型创建的:
|-- current_time: string (nullable = true)
|-- diff: double (nullable = true)
|-- time_diff: double (nullable = true)
我尝试使用字符串和文字值创建列以进行测试,但输出始终为null
。我错过了什么?
【问题讨论】:
【参考方案1】:要用current_datetime
填充一列,您缺少lit()
函数:
current_datetime = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
productsDF = productsDF.withColumn("current_time", lit(current_datetime))
要计算两个timestamp
列之间的时间差,您可以这样做:
productsDF.withColumn('time_diff',(F.unix_timestamp('current_time') -
F.unix_timestamp('Created_datetime'))/3600).show()
编辑:
对于小时、天、月和年的时差,您可以:
df.withColumn('time_diff_hours',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/3600)\
.withColumn("time_diff_days", datediff(col("current_time"),col("Created_datetime")))\
.withColumn("time_diff_months", months_between(col("current_time"),col("Created_datetime")))\
.withColumn("time_diff_years", year(col("current_time")) - year(col("Created_datetime"))).show()
+-------------------+-------------------+------------------+--------------+----------------+---------------+
| Created_datetime| current_time| time_diff_hours|time_diff_days|time_diff_months|time_diff_years|
+-------------------+-------------------+------------------+--------------+----------------+---------------+
|2019-10-12 17:09:18|2020-10-15 02:45:49| 8841.60861111111| 369| 12.07743093| 1|
|2019-12-03 07:02:07|2020-10-15 02:45:49|7602.7283333333335| 317| 10.38135529| 1|
|2020-01-16 23:10:08|2020-10-15 02:45:49| 6530.594722222222| 273| 8.94031549| 0|
+-------------------+-------------------+------------------+--------------+----------------+---------------+
如果你想要精确的时差,那么:
df.withColumn('time_diff_hours',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/3600)\
.withColumn('time_diff_days',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/(3600*24))\
.withColumn('time_diff_years',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/(3600*24*365)).show()
+-------------------+-------------------+------------------+------------------+------------------+
| Created_datetime| current_time| time_diff_hours| time_diff_days| time_diff_years|
+-------------------+-------------------+------------------+------------------+------------------+
|2019-10-12 17:09:18|2020-10-15 02:45:49| 8841.60861111111| 368.4003587962963|1.0093160514967021|
|2019-12-03 07:02:07|2020-10-15 02:45:49|7602.7283333333335|316.78034722222225|0.8678913622526636|
|2020-01-16 23:10:08|2020-10-15 02:45:49| 6530.594722222222| 272.1081134259259|0.7455016806189751|
+-------------------+-------------------+------------------+------------------+------------------+
【讨论】:
谢谢,现在我得到了time_diff
的以下输出,这是我之前实现的(代码更加混乱):+-------------------+-------------------+--------------------+ | current_time| Created_datetime| time_diff| +-------------------+-------------------+--------------------+ |2020-10-15 07:11:14|2019-10-12 17:09:18| 1.602302314845E9| |2020-10-15 07:11:14|2019-12-03 07:02:07|1.6023010759647222E9| |2020-10-15 07:11:14|2020-01-16 23:10:08|1.6023000038311112E9|
我可以以某种方式创建列以显示时间差异(年、月,天等)?
你的代码很不错。只需删除F.to_timestamp()
,因为根据问题,这些字段已经在时间戳中。我添加了代码来计算小时、天、月和年的时差。以上是关于新创建的列在 pyspark 数据框中显示空值的主要内容,如果未能解决你的问题,请参考以下文章
应用 StringIndexer 更改 PySpark 数据框中的列