pyspark - 无法从日期列中获取一年中的季度和一周
Posted
技术标签:
【中文标题】pyspark - 无法从日期列中获取一年中的季度和一周【英文标题】:pyspark - can't get quarter and week of year from date column 【发布时间】:2020-12-01 14:49:37 【问题描述】:我有一个如下所示的 pyspark 数据框:
+--------+----------+---------+----------+-----------+--------------------+
|order_id|product_id|seller_id| date|pieces_sold| bill_raw_text|
+--------+----------+---------+----------+-----------+--------------------+
| 668| 886059| 3205|2015-01-14| 91|pbdbzvpqzqvtzxone...|
| 6608| 541277| 1917|2012-09-02| 44|cjucgejlqnmfpfcmg...|
| 12962| 613131| 2407|2016-08-26| 90|cgqhggsjmrgkrfevc...|
| 14223| 774215| 1196|2010-03-04| 46|btujmkfntccaewurg...|
| 15131| 769255| 1546|2018-11-28| 13|mrfsamfuhpgyfjgki...|
| 15625| 86357| 2455|2008-04-18| 50|wlwsliatrrywqjrih...|
| 18470| 26238| 295|2009-03-06| 86|zrfdpymzkgbgdwFwz...|
| 29883| 995036| 4596|2009-10-25| 86|oxcutwmqgmioaelsj...|
| 38428| 193694| 3826|2014-01-26| 82|yonksvwhrfqkytypr...|
| 41023| 949332| 4158|2014-09-03| 83|hubxhfdtxrqsfotdq...|
+--------+----------+---------+----------+-----------+--------------------+
我想创建两列,一个是一年的季度,另一个是一年的周数。这就是我所做的,参考weekofyear 和quarter 的文档:
from pyspark.sql import functions as F
sales_table = sales_table.withColumn("week_year", F.date_format(F.to_date("date", "yyyy-mm-dd"),
F.weekofyear("d")))
sales_table = sales_table.withColumn("quarter", F.date_format(F.to_date("date", "yyyy-mm-dd"),
F.quarter("d")))
sales_table.show(10)
这是错误:
Column is not iterable
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 945, in date_format
return Column(sc._jvm.functions.date_format(_to_java_column(date), format))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1296, in __call__
args_command, temp_args = self._build_args(*args)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1260, in _build_args
(new_args, temp_args) = self._get_args(args)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1247, in _get_args
temp_arg = converter.convert(arg, self.gateway_client)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 510, in convert
for element in object:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/column.py", line 353, in __iter__
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
如何创建和附加这两列?
有没有更好或更有效的方法来创建这些列,而不是每次都必须将date
列转换为yyyy-mm-dd
格式并在一个命令中创建这两个列?
【问题讨论】:
【参考方案1】:您可以直接使用字符串列date
上的函数。
from pyspark.sql import functions as F
df = df.select(
'*',
F.weekofyear('date').alias('week_year'),
F.quarter('date').alias('quarter')
)
df.show()
+--------+----------+---------+----------+-----------+--------------------+---------+-------+
|order_id|product_id|seller_id| date|pieces_sold| bill_raw_text|week_year|quarter|
+--------+----------+---------+----------+-----------+--------------------+---------+-------+
| 668| 886059| 3205|2015-01-14| 91|pbdbzvpqzqvtzxone...| 3| 1|
| 6608| 541277| 1917|2012-09-02| 44|cjucgejlqnmfpfcmg...| 35| 3|
| 12962| 613131| 2407|2016-08-26| 90|cgqhggsjmrgkrfevc...| 34| 3|
| 14223| 774215| 1196|2010-03-04| 46|btujmkfntccaewurg...| 9| 1|
| 15131| 769255| 1546|2018-11-28| 13|mrfsamfuhpgyfjgki...| 48| 4|
| 15625| 86357| 2455|2008-04-18| 50|wlwsliatrrywqjrih...| 16| 2|
| 18470| 26238| 295|2009-03-06| 86|zrfdpymzkgbgdwFwz...| 10| 1|
| 29883| 995036| 4596|2009-10-25| 86|oxcutwmqgmioaelsj...| 43| 4|
| 38428| 193694| 3826|2014-01-26| 82|yonksvwhrfqkytypr...| 4| 1|
| 41023| 949332| 4158|2014-09-03| 83|hubxhfdtxrqsfotdq...| 36| 3|
+--------+----------+---------+----------+-----------+--------------------+---------+-------+
【讨论】:
【参考方案2】:你不必在这里使用date_format
函数,因为你已经有date
的yyyy-MM-dd
格式直接使用week_of_year and quarter
在date
列上。
示例:
df.show()
#+----------+
#| date|
#+----------+
#|2015-01-14|
#+----------+
from pyspark.sql import functions as F
df.withColumn("week_year", F.weekofyear(F.col("date"))).\
withColumn("quarter", F.quarter(F.col("date"))).\
show()
#+----------+---------+-------+
#| date|week_year|quarter|
#+----------+---------+-------+
#|2015-01-14| 3| 1|
#+----------+---------+-------+
【讨论】:
以上是关于pyspark - 无法从日期列中获取一年中的季度和一周的主要内容,如果未能解决你的问题,请参考以下文章