如何通过 Pyspark 中同一数据框中另一列的正则表达式值过滤数据框中的一列
Posted
技术标签:
【中文标题】如何通过 Pyspark 中同一数据框中另一列的正则表达式值过滤数据框中的一列【英文标题】:How to filter a column in a data frame by the regex value of another column in same data frame in Pyspark 【发布时间】:2020-03-02 18:21:50 【问题描述】:我正在尝试过滤数据框中与另一列中给出的正则表达式模式匹配的列
df = sqlContext.createDataFrame([('what is the movie that features Tom Cruise','actor_movies','(movie|film).*(feature)|(in|on).*(movie|film)'),
('what is the movie that features Tom Cruise','artist_song','(who|what).*(sing|sang|perform)'),
('who is the singer for hotel califonia?','artist_song','(who|what).*(sing|sang|perform)')],
['query','question_type','regex_patt'])
+--------------------+-------------+----------------------------------------- -+
| query |question_type |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise | actor_movies | (movie|film).*(feature)|(in|on).*(movie|film)
|what movie features Tom Cruise | artist_song | (who|what).*(sing|sang|perform)
|who is the singer for hotel califonia | artist_song | (who|what).*(sing|sang|perform) |
+--------------------+-------------+------------------------------------------------+
我想修剪数据框,以便只保留查询与 regex_pattern 列值匹配的行。 最终的结果应该是这样的
+--------------------+-------------+----------------------------------------- -+
| query |question_type |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise | actor_movies | (movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia | artist_song | (who|what).*(sing|sang|perform)
+--------------------+-------------+------------------------------------------------+
我在想
df.filter(column('query').rlike('regex_patt'))
但是 rlike 只接受正则表达式字符串。
现在的问题是,如何根据"regex_patt"
列的正则表达式值过滤"query"
列?
【问题讨论】:
【参考方案1】:你可以试试这个。该表达式允许您将列作为 str 和模式。
from pyspark.sql import functions as F
df.withColumn("query1", F.expr("""regexp_extract(query, regex_patt)""")).filter(F.col("query1")!='').drop("query1").show(truncate=False)
+------------------------------------------+-------------+---------------------------------------------+
|query |question_type|regex_patt |
+------------------------------------------+-------------+---------------------------------------------+
|what is the movie that features Tom Cruise|actor_movies |(movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia? |artist_song |(who|what).*(sing|sang|perform) |
+------------------------------------------+-------------+---------------------------------------------+
【讨论】:
非常感谢,如何在 expr() 中添加别名?任何样品请以上是关于如何通过 Pyspark 中同一数据框中另一列的正则表达式值过滤数据框中的一列的主要内容,如果未能解决你的问题,请参考以下文章
使用同一 Dataframe 中另一列的 int 作为索引获取列中的列表值