在pyspark数据框中orderby之后选择第n行

Posted

技术标签:

【中文标题】在pyspark数据框中orderby之后选择第n行【英文标题】:Select nth row after orderby in pyspark dataframe 【发布时间】:2020-08-11 12:22:30 【问题描述】:

我想为每组名称选择第二行。我使用 orderby 按名称排序,然后按购买日期/时间戳排序。重要的是我为每个名称选择第二次购买(按日期时间)。

这是构建数据框的数据:

data = [
  ('George', datetime(2020, 3, 24, 3, 19, 58), datetime(2018, 2, 24, 3, 22, 55)),
  ('Andrew', datetime(2019, 12, 12, 17, 21, 30), datetime(2019, 7, 21, 2, 14, 22)),
  ('Micheal', datetime(2018, 11, 22, 13, 29, 40), datetime(2018, 5, 17, 8, 10, 19)),
  ('Maggie', datetime(2019, 2, 8, 3, 31, 23), datetime(2019, 5, 19, 6, 11, 33)),
  ('Ravi', datetime(2019, 1, 1, 4, 19, 47), datetime(2019, 1, 1, 4, 22, 55)),
  ('Xien', datetime(2020, 3, 2, 4, 33, 51), datetime(2020, 5, 21, 7, 11, 50)),
  ('George', datetime(2020, 3, 24, 3, 19, 58), datetime(2020, 3, 24, 3, 22, 45)),
  ('Andrew', datetime(2019, 12, 12, 17, 21, 30), datetime(2019, 9, 19, 1, 14, 11)),
  ('Micheal', datetime(2018, 11, 22, 13, 29, 40), datetime(2018, 8, 19, 7, 11, 37)),
  ('Maggie', datetime(2019, 2, 8, 3, 31, 23), datetime(2018, 2, 19, 6, 11, 42)),
  ('Ravi', datetime(2019, 1, 1, 4, 19, 47), datetime(2019, 1, 1, 4, 22, 17)),
  ('Xien', datetime(2020, 3, 2, 4, 33, 51), datetime(2020, 6, 21, 7, 11, 11)),
  ('George', datetime(2020, 3, 24, 3, 19, 58), datetime(2020, 4, 24, 3, 22, 54)),
  ('Andrew', datetime(2019, 12, 12, 17, 21, 30), datetime(2019, 8, 30, 3, 12, 41)),
  ('Micheal', datetime(2018, 11, 22, 13, 29, 40), datetime(2017, 5, 17, 8, 10, 38)),
  ('Maggie', datetime(2019, 2, 8, 3, 31, 23), datetime(2020, 3, 19, 6, 11, 12)),
  ('Ravi', datetime(2019, 1, 1, 4, 19, 47), datetime(2018, 2, 1, 4, 22, 24)),
  ('Xien', datetime(2020, 3, 2, 4, 33, 51), datetime(2018, 9, 21, 7, 11, 41)),
]
 
df = sqlContext.createDataFrame(data, ['name', 'trial_start', 'purchase'])
df.show(truncate=False)

我按名称订购数据,然后购买

df.orderBy("name","purchase").show()

产生结果:

+-------+-------------------+-------------------+
|   name|        trial_start|           purchase|
+-------+-------------------+-------------------+
| Andrew|2019-12-12 22:21:30|2019-07-21 06:14:22|
| Andrew|2019-12-12 22:21:30|2019-08-30 07:12:41|
| Andrew|2019-12-12 22:21:30|2019-09-19 05:14:11|
| George|2020-03-24 07:19:58|2018-02-24 08:22:55|
| George|2020-03-24 07:19:58|2020-03-24 07:22:45|
| George|2020-03-24 07:19:58|2020-04-24 07:22:54|
| Maggie|2019-02-08 08:31:23|2018-02-19 11:11:42|
| Maggie|2019-02-08 08:31:23|2019-05-19 10:11:33|
| Maggie|2019-02-08 08:31:23|2020-03-19 10:11:12|
|Micheal|2018-11-22 18:29:40|2017-05-17 12:10:38|
|Micheal|2018-11-22 18:29:40|2018-05-17 12:10:19|
|Micheal|2018-11-22 18:29:40|2018-08-19 11:11:37|
|   Ravi|2019-01-01 09:19:47|2018-02-01 09:22:24|
|   Ravi|2019-01-01 09:19:47|2019-01-01 09:22:17|
|   Ravi|2019-01-01 09:19:47|2019-01-01 09:22:55|
|   Xien|2020-03-02 09:33:51|2018-09-21 11:11:41|
|   Xien|2020-03-02 09:33:51|2020-05-21 11:11:50|
|   Xien|2020-03-02 09:33:51|2020-06-21 11:11:11|
+-------+-------------------+-------------------+

如何获得每个名称的第二行?在熊猫中这很容易。我可以只使用nth。我一直在看sql,但没有找到解决方案。任何建议表示赞赏。

我正在寻找的输出是:

+-------+-------------------+-------------------+
|   name|        trial_start|           purchase|
+-------+-------------------+-------------------+
| Andrew|2019-12-12 22:21:30|2019-08-30 07:12:41|
| George|2020-03-24 07:19:58|2020-03-24 07:22:45|
| Maggie|2019-02-08 08:31:23|2019-05-19 10:11:33|
|Micheal|2018-11-22 18:29:40|2018-05-17 12:10:19|
|   Ravi|2019-01-01 09:19:47|2019-01-01 09:22:17|
|   Xien|2020-03-02 09:33:51|2020-05-21 11:11:50|
+-------+-------------------+-------------------+

【问题讨论】:

【参考方案1】:

尝试使用window row_number() 函数,然后在按purchase 排序后仅过滤2 行。

Example:

from pyspark.sql import *
from pyspark.sql.functions import *

w=Window.partitionBy("name").orderBy(col("purchase"))

df.withColumn("rn",row_number().over(w)).filter(col("rn") ==2).drop(*["rn"]).show()

SQL Api:

df.createOrReplaceTempView("tmp")

spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")

sql("select `(rn)?+.+` from (select *,row_number() over(partition by name order by purchase) rn from tmp) e where rn =2").\
show()

【讨论】:

谢谢。我让你的 sql 正常工作。我也以另一种方式做到了:sqlContext.sql("select name, trial_start, purchase \ from \ ( \ select name, trial_start, purchase, \ row_number() over(partition by name order by purchase asc) as rn \ from table \ ) x \ where x.rn = 2;").show() 我想让您的 Window 方法发挥作用。我在 col 上收到一个我没有弄清楚的错误。有什么想法吗? from pyspark.sql import * from pyspark.sql.functions import * w=Window.partitionBy("name").orderBy(col("purchase")) df.withColumn("rn",row_number().over(w)).filter(col("rn") ==2).drop(["rn"]).show() TypeError Traceback (most recent call last) in 3 4 w=Window.partitionBy("name").orderBy(col("purchase")) ----> 5 df.withColumn("rn",row_number().over(w)).filter(col("rn") ==2).drop(["rn"]).show() ~/spark/spark-3.0.0-bin-hadoop2.7/python/pyspark/sql/dataframe.py in drop(self, *cols) 2141 jdf = self._jdf.drop(col._jc) 2142 else: -> 2143 raise TypeError("col should be a string or a Column") 2144 else: 2145 for col in cols: TypeError: col should be a string or a Column 尝试使用 .drop(*["rn"]) 我们需要将drop 中的列表与* 取消嵌套,我错误地复制了答案..! 你可以测试本站所有的正则表达式regex101.com这将是一个好的开始medium.com/factory-mind/…

以上是关于在pyspark数据框中orderby之后选择第n行的主要内容,如果未能解决你的问题,请参考以下文章

在 pyspark 数据框中的第一个序号位置添加一个新列

使用 spark-xml 从 pyspark 数据框中选择嵌套列

从 pyspark 数据框中删除第一行

PySpark-如何从此数据框中过滤行

为pyspark数据框中的记录间隔分配一个常数值

如何从 pyspark 中的数据框中仅选择 70% 的重新编码?