pyspark - 无法从日期列中获取一年中的季度和一周

Posted

技术标签:

【中文标题】pyspark - 无法从日期列中获取一年中的季度和一周【英文标题】:pyspark - can't get quarter and week of year from date column 【发布时间】:2020-12-01 14:49:37 【问题描述】:

我有一个如下所示的 pyspark 数据框:

+--------+----------+---------+----------+-----------+--------------------+
|order_id|product_id|seller_id|      date|pieces_sold|       bill_raw_text|
+--------+----------+---------+----------+-----------+--------------------+
|     668|    886059|     3205|2015-01-14|         91|pbdbzvpqzqvtzxone...|
|    6608|    541277|     1917|2012-09-02|         44|cjucgejlqnmfpfcmg...|
|   12962|    613131|     2407|2016-08-26|         90|cgqhggsjmrgkrfevc...|
|   14223|    774215|     1196|2010-03-04|         46|btujmkfntccaewurg...|
|   15131|    769255|     1546|2018-11-28|         13|mrfsamfuhpgyfjgki...|
|   15625|     86357|     2455|2008-04-18|         50|wlwsliatrrywqjrih...|
|   18470|     26238|      295|2009-03-06|         86|zrfdpymzkgbgdwFwz...|
|   29883|    995036|     4596|2009-10-25|         86|oxcutwmqgmioaelsj...|
|   38428|    193694|     3826|2014-01-26|         82|yonksvwhrfqkytypr...|
|   41023|    949332|     4158|2014-09-03|         83|hubxhfdtxrqsfotdq...|
+--------+----------+---------+----------+-----------+--------------------+

我想创建两列,一个是一年的季度,另一个是一年的周数。这就是我所做的,参考weekofyear 和quarter 的文档:

from pyspark.sql import functions as F
sales_table = sales_table.withColumn("week_year", F.date_format(F.to_date("date", "yyyy-mm-dd"),
                                                                F.weekofyear("d")))
sales_table = sales_table.withColumn("quarter", F.date_format(F.to_date("date", "yyyy-mm-dd"),
                                                              F.quarter("d")))
sales_table.show(10)

这是错误:

Column is not iterable
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 945, in date_format
    return Column(sc._jvm.functions.date_format(_to_java_column(date), format))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1296, in __call__
    args_command, temp_args = self._build_args(*args)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1260, in _build_args
    (new_args, temp_args) = self._get_args(args)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1247, in _get_args
    temp_arg = converter.convert(arg, self.gateway_client)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 510, in convert
    for element in object:
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/column.py", line 353, in __iter__
    raise TypeError("Column is not iterable")
TypeError: Column is not iterable

如何创建和附加这两列?

有没有更好或更有效的方法来创建这些列,而不是每次都必须将date 列转换为yyyy-mm-dd 格式并在一个命令中创建这两个列?

【问题讨论】:

【参考方案1】:

您可以直接使用字符串列date上的函数。

from pyspark.sql import functions as F

df = df.select(
    '*',
    F.weekofyear('date').alias('week_year'), 
    F.quarter('date').alias('quarter')
)
df.show()

+--------+----------+---------+----------+-----------+--------------------+---------+-------+
|order_id|product_id|seller_id|      date|pieces_sold|       bill_raw_text|week_year|quarter|
+--------+----------+---------+----------+-----------+--------------------+---------+-------+
|     668|    886059|     3205|2015-01-14|         91|pbdbzvpqzqvtzxone...|        3|      1|
|    6608|    541277|     1917|2012-09-02|         44|cjucgejlqnmfpfcmg...|       35|      3|
|   12962|    613131|     2407|2016-08-26|         90|cgqhggsjmrgkrfevc...|       34|      3|
|   14223|    774215|     1196|2010-03-04|         46|btujmkfntccaewurg...|        9|      1|
|   15131|    769255|     1546|2018-11-28|         13|mrfsamfuhpgyfjgki...|       48|      4|
|   15625|     86357|     2455|2008-04-18|         50|wlwsliatrrywqjrih...|       16|      2|
|   18470|     26238|      295|2009-03-06|         86|zrfdpymzkgbgdwFwz...|       10|      1|
|   29883|    995036|     4596|2009-10-25|         86|oxcutwmqgmioaelsj...|       43|      4|
|   38428|    193694|     3826|2014-01-26|         82|yonksvwhrfqkytypr...|        4|      1|
|   41023|    949332|     4158|2014-09-03|         83|hubxhfdtxrqsfotdq...|       36|      3|
+--------+----------+---------+----------+-----------+--------------------+---------+-------+

【讨论】:

【参考方案2】:

你不必在这里使用date_format函数,因为你已经有dateyyyy-MM-dd格式直接使用week_of_year and quarterdate 列上。

示例:

df.show()
#+----------+
#|      date|
#+----------+
#|2015-01-14|
#+----------+
from pyspark.sql import functions as F

df.withColumn("week_year", F.weekofyear(F.col("date"))).\
withColumn("quarter", F.quarter(F.col("date"))).\
show()
#+----------+---------+-------+
#|      date|week_year|quarter|
#+----------+---------+-------+
#|2015-01-14|        3|      1|
#+----------+---------+-------+

【讨论】:

以上是关于pyspark - 无法从日期列中获取一年中的季度和一周的主要内容,如果未能解决你的问题,请参考以下文章

如何从SQL中的日期获取日历季度

Sql server判断某一日期是在第几季度

Python PySpark:从日期列中减去整数列错误:列对象不可调用

如果日期在季度范围内,PySpark 添加列

Sql server判断某一日期是在第几季度?

Postgres / Redshift:在一次调用中从组的日期列中提取季度和年份?