如何通过 Pyspark 中同一数据框中另一列的正则表达式值过滤数据框中的一列

Posted

技术标签:

【中文标题】如何通过 Pyspark 中同一数据框中另一列的正则表达式值过滤数据框中的一列【英文标题】:How to filter a column in a data frame by the regex value of another column in same data frame in Pyspark 【发布时间】:2020-03-02 18:21:50 【问题描述】:

我正在尝试过滤数据框中与另一列中给出的正则表达式模式匹配的列

df = sqlContext.createDataFrame([('what is the movie that features Tom Cruise','actor_movies','(movie|film).*(feature)|(in|on).*(movie|film)'),
    ('what is the movie that features Tom Cruise','artist_song','(who|what).*(sing|sang|perform)'),
    ('who is the singer for hotel califonia?','artist_song','(who|what).*(sing|sang|perform)')],  
['query','question_type','regex_patt'])

+--------------------+-------------+----------------------------------------- -+
|               query                   |question_type  |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise         | actor_movies  | (movie|film).*(feature)|(in|on).*(movie|film)
|what movie features Tom Cruise         | artist_song   | (who|what).*(sing|sang|perform)
|who is the singer for hotel califonia  | artist_song   | (who|what).*(sing|sang|perform) |
 +--------------------+-------------+------------------------------------------------+

我想修剪数据框,以便只保留查询与 regex_pattern 列值匹配的行。 最终的结果应该是这样的

+--------------------+-------------+----------------------------------------- -+
|               query                   |question_type  |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise         | actor_movies  | (movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia  | artist_song   | (who|what).*(sing|sang|perform) 
 +--------------------+-------------+------------------------------------------------+

我在想

df.filter(column('query').rlike('regex_patt'))

但是 rlike 只接受正则表达式字符串。

现在的问题是,如何根据"regex_patt" 列的正则表达式值过滤"query" 列?

【问题讨论】:

【参考方案1】:

你可以试试这个。该表达式允许您将列作为 str 和模式。

from pyspark.sql import functions as F
df.withColumn("query1", F.expr("""regexp_extract(query, regex_patt)""")).filter(F.col("query1")!='').drop("query1").show(truncate=False)

+------------------------------------------+-------------+---------------------------------------------+
|query                                     |question_type|regex_patt                                   |
+------------------------------------------+-------------+---------------------------------------------+
|what is the movie that features Tom Cruise|actor_movies |(movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia?    |artist_song  |(who|what).*(sing|sang|perform)              |
+------------------------------------------+-------------+---------------------------------------------+

【讨论】:

非常感谢,如何在 expr() 中添加别名?任何样品请

以上是关于如何通过 Pyspark 中同一数据框中另一列的正则表达式值过滤数据框中的一列的主要内容,如果未能解决你的问题,请参考以下文章

用 pandas 数据框中另一列的值填充多列中的 Na

根据火花数据框中另一列的值查找列的最大值?

使用同一表中另一列的数据计数更新一列? [关闭]

使用同一 Dataframe 中另一列的 int 作为索引获取列中的列表值

用 Pandas 将 DataFrame 中某些列和行的值替换为同一 DataFrame 中另一列的值

使用同一表中另一列的键更新 mysql 列