针对主表一列 spark.sql 验证 2 列中的数据

Posted

技术标签:

【中文标题】针对主表一列 spark.sql 验证 2 列中的数据【英文标题】:Validate data in 2 columns against master table one column spark.sql 【发布时间】:2019-08-22 09:05:32 【问题描述】:

我有 2 个表,比如 ZIPCODE 的主表,以及一个包含当前地址和永久地址的事务表。两个地址列都有 ZIPCODE。我需要根据主表验证这 2 个邮政编码。

Master Table:
+--------+--------------+-----+ 
|zip_code|territory_name|state| 
+--------+--------------+-----+ 
| 81A02| TERR NAME 02| NY| 
| 81A04| TERR NAME 04| FL| 
| 81A05| TERR NAME 05| NJ| 
| 81A06| TERR NAME 06| CA| 
| 81A07| TERR NAME 06| CA|
+--------+--------------+-----+

Transaction table:
+--------+--------------+-----+ 
|Address1_zc|Address2_zc|state| 
+--------+--------------+-----+ 
| 81A02| 81A05| NY| 
| 81A04| 81A06| FL| 
| 81A05| 90005| NJ| 
| 81A06| 90006| CA| 
| 41A06| 81A06| CA|
+--------+--------------+-----+

结果集应仅包含 ADDRESS1_ZC 和 ADDRESS2_ZC 中的有效邮政编码。

 +-----------+-----------+-----+ 
 |Address1_zc|Address2_zc|state| 
 +-----------+-----------+-----+ 
 | 81A02     | 81A05     | NY  | 
 | 81A04     | 81A06     | FL  | 
 +-----------+-----------+-----+

我在此提供数据框用于测试:

df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ"),("81A01","TERR NAME 01","CA"),("81A02","TERR NAME 02","NY"),("81A03","TERR NAME 03","NY"), ("81A03","TERR NAME 03","CA"), ("81A04","TERR NAME 04","FL"), ("81A05","TERR NAME 05","NJ"), ("81A06","TERR NAME 06","CA"), ("81A06","TERR NAME 06","CA")], ["zip_code","territory_name","state"])
df1.createOrReplaceTempView("df1_mast")

df1= sqlContext.createDataFrame([("81A02","81A05"),("81A04","81A06"),("81A05","90005"),("81A06","90006"),("41A06","81A06")], ["Address1_zc","Address2_zc"])
df1.createOrReplaceTempView("df1_tran")

我尝试了以下 SQL,但无法获得所需的结果。

select a.* df1_tran a join df1_mast b on a.zip_code = b.Address_zc1 or a.zip_code = b.Address_zc2 where a.zip_code is null

请帮帮我。

【问题讨论】:

我不明白你的 81A0581A06 在第三个数据帧中的 Address2_zc 中来自哪里 or换成and能用吗? Peirre Gourseaud,我已经更新了数据集,请你看看。谢谢 不,即使用 AND 替换 OR 也没有用 【参考方案1】:

Pyspark 方式:

df1 = sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ"),("81A01","TERR NAME 01","CA"),("81A02","TERR NAME 02","NY"),("81A03","TERR NAME 03","NY"), ("81A03","TERR NAME 03","CA"), ("81A04","TERR NAME 04","FL"), ("81A05","TERR NAME 05","NJ"), ("81A06","TERR NAME 06","CA"), ("81A06","TERR NAME 06","CA")], ["zip_code","territory_name","state"])

df2 = sqlContext.createDataFrame([("81A02","81A05"),("81A04","81A06"),("81A05","90005"),("81A05","90006"),("41A06","81A06")], ["Address1_zc","Address2_zc"])

df3 = df2.join(df1, df2['Address1_zc'] == df1['zip_code'], 'inner')
df4 = df3.withColumnRenamed('state', 'state1').drop(*(df1.columns))
df5 = df4.join(df1, df2['Address2_zc'] == df1['zip_code'], 'inner')
df6 = df5.withColumnRenamed('state', 'state2').drop(*(df1.columns))
df4.show()

 +-----------+-----------+------+------+
 |Address1_zc|Address2_zc|state1|state2|
 +-----------+-----------+------+------+
 | 81A02     | 81A05     |NY    |NJ    |
 | 81A04     | 81A06     |FL    |CA    |
 +-----------+-----------+------+------+

SQL方式:

SELECT t.*,
       a.state AS state1, 
       b.state AS state2
FROM df2 AS t
       JOIN df1 AS a ON t.Address1_zc = a.zip_code      
       JOIN df1 AS b ON t.Address2_zc = b.zip_code

【讨论】:

谢谢,但是你能以 spark.sql JOIN 格式提供这个吗,因为我不能在我的项目中使用 python 方式。 @Yuva 我添加了 SQL 语法

以上是关于针对主表一列 spark.sql 验证 2 列中的数据的主要内容,如果未能解决你的问题,请参考以下文章

EXCEL表一列数据中只筛选出“整数”或是“小数”怎么办????

java - 如何根据Java Servlet中的外键在一列中显示多个值?

Spark SQL - “包含”功能的替代方案

Excel 用VBA统计这个表一共用多少列多少行

Scala Spark用NULL替换空String

列中的重复值