需要使用迭代多个列的过滤器值过滤 Spark 数据帧
Posted
技术标签:
【中文标题】需要使用迭代多个列的过滤器值过滤 Spark 数据帧【英文标题】:Need to filter the Spark dataframe with filter values from iterating over multiple columns 【发布时间】:2021-05-13 09:27:58 【问题描述】:我在 Spark 数据框中有以下数据集。我需要根据给定条件进行过滤:
等于:ID: (6, 7, 8, 9, 13, 15, 16, 18)
不等于:STATE :(Illinois, Oklahoma)
,CITY: (Orange, Boca_Raton)
我需要遍历这些列以将过滤器值作为键值对获取,而不是对值进行硬编码并过滤数据框以获取结果 df。
id | NAME | CITY | STATE |
---|---|---|---|
1 | Roseann | Richmond | Virginia |
3 | Jameson | Fort_Lauderdale | Florida |
4 | Marline | Washington | District_of_Columbia |
5 | Ivory | Macon | Georgia |
6 | Toby | San_Diego | California |
7 | Isacco | Honolulu | Illinois |
8 | Sallee | Orange | California |
9 | Lannie | Peoria | Oklahoma |
10 | Bradley | Tulsa | Oklahoma |
11 | Teodora | Pittsburgh | Pennsylvania |
12 | Benedikta | Tampa | Florida |
13 | Zelma | Newport_News | California |
14 | Carilyn | Flint | Michigan |
15 | Joey | Boca_Raton | California |
16 | Pattie | Boston | Massachusetts |
17 | Dag | Bismarck | North_Dakota |
18 | Glynn | Decatur | Oklahoma |
19 | Hilton | Phoenix | Arizona |
20 | Barbette | New_Orleans | Louisiana |
【问题讨论】:
【参考方案1】:您可以将isin
函数与值列表一起使用。像这样的:
val listIDs = Seq(6, 7, 8, 9, 13, 15, 16, 18)
val listStates = Seq("Illinois", "Oklahoma")
val listCityes = Seq("Orange", "Boca_Raton")
val conditionExpr = Seq(
col("id").isin(listIDs: _*),
!col("STATE").isin(listStates: _*),
!col("CITY").isin(listCityes: _*)
).reduce(_ and _)
val df1 = df.filter(conditionExpr)
df1.show
//+---+------+------------+-------------+
//| id| NAME| CITY| STATE|
//+---+------+------------+-------------+
//| 6| Toby| San_Diego| California|
//| 13| Zelma|Newport_News| California|
//| 16|Pattie| Boston|Massachusetts|
//+---+------+------------+-------------+
【讨论】:
以上是关于需要使用迭代多个列的过滤器值过滤 Spark 数据帧的主要内容,如果未能解决你的问题,请参考以下文章