如果满足任何（不是全部）条件，如何执行火花连接

Posted 2023-04-15

技术标签:

【中文标题】如果满足任何（不是全部）条件，如何执行火花连接【英文标题】：How to perform a spark join if any (not all) conditions are met 【发布时间】：2021-02-26 21:09:49 【问题描述】：

pyspark documentation 声明可以按以下方式执行连接操作：

cond = [df.name == df3.name, df.age == df3.age]
df.join(df3, cond, 'outer').select(df.name, df3.age).collect()

这将成功加入名称和年龄列匹配的行。我正在尝试执行相同的连接，但条件是名称或年龄列匹配。

我试过了：

df.join(import_df, df.col1 == import_df.colA | df.col2 == import_df.colB , how="left")

但这给了我一个错误：

ValueError：无法将列转换为布尔值：请使用 '&' 表示 'and'、'|' for 'or', '~' for 'not' 在构建 DataFrame 布尔表达式时。

【问题讨论】：

将条件放在括号之间。运算符“|”优先于 "==" 【参考方案1】：

尝试将连接条件包裹在括号 () 然后使用 or | 运算符加入。

df.join(import_df, (df.col1 == import_df.colA) | (df.col2 == import_df.colB) , "left")

Using cond variable:

cond=[(df.col1 == import_df.colA) | (df.col2 == import_df.colB)]
df.join(import_df, cond, "left").show()

【讨论】：

以上是关于如果满足任何（不是全部）条件，如何执行火花连接的主要内容，如果未能解决你的问题，请参考以下文章