如何通过嵌套数组字段(数组中的数组)过滤Spark sql?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何通过嵌套数组字段(数组中的数组)过滤Spark sql?相关的知识,希望对你有一定的参考价值。
我的Spark数据框架构
|-- goods: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- brand_id: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- product_id: string (nullable = true)
我能够按product_id过滤数据框
select * from goodsInfo where array_contains(goods.product_id, 'f31ee3f8-9ba2-49cb-86e2-ceb44e34efd9')
但是我无法通过brand_id进行过滤,brand_id是array中的一个数组。
尝试时出现错误
select * from goodsInfo where array_contains(goods.brand_id, '45c060b9-3645-49ad-86eb-65f3cd4e9081')
错误:
function array_contains should have been array followed by a value with same element type, but it's [array<array<string>>, string].; line 1 pos 45;
有人可以帮忙吗?
提前感谢
我创建了一个示例DataFrame并展开了数组列,然后过滤了记录。 PFB的示例代码相同。
import org.apache.spark.sql.functions._
val jsonData = """
"id": "0001",
"batters":
"batter":
[
1001,
1002,
1003
]
"""
// Create DataFrame
val df = spark.read.json(Seq(jsonData).toDS).toDF
// Explode the array column and then filter.
val newdf = df.select($"id", explode($"batters.batter").as("batter")).filter("batter = 1001")
newdf.show()
//输出
+----+------+
| id|batter|
+----+------+
|0001| 1001|
+----+------+
您也可以参考我的blog“在Apache Spark中使用JSON”
我希望这会有所帮助。
function array_contains should have been array followed by a value with same element type, but it's [array<array<string>>, string].; line 1 pos 45;
这是因为brand_id
的类型为array<array<string>>
,并且您传递的值的类型为string
,您必须将值包装在array
中,即
array_contains(goods.brand_id, array('45c060b9-3645-49ad-86eb-65f3cd4e9081'))
仅当数组brand_id
中包含where
子句中的元素的确切数目时,它才有效。检查下面的代码。
scala> df.printSchema
root
|-- goods: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- brand_id: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- product_id: string (nullable = true)
scala> spark.sql("select * from goodsInfo").show(false)
+----------------------+
|goods |
+----------------------+
|[[[B100, B101], P100]]|
|[[[B200, B200], P200]]|
|[[[B300], P300]] |
+----------------------+
scala> spark.sql("select * from goodsInfo where array_contains(goods.brand_id, array('B300'))").show(false) // Exact one element
+----------------+
|goods |
+----------------+
|[[[B300], P300]]|
+----------------+
scala> spark.sql("select * from goodsInfo where array_contains(goods.brand_id, array('B100'))").show(false) // Passing single element from array. It will not give result.
+-----+
|goods|
+-----+
+-----+
scala> spark.sql("select * from goodsInfo where array_contains(goods.brand_id, array('B100','B101'))").show(false) // Pass complete array, It will give result.
+----------------------+
|goods |
+----------------------+
|[[[B100, B101], P100]]|
+----------------------+
代替array_contains
尝试使用LATERAL VIEW explode
爆炸嵌套的array
值,请检查以下查询。
SELECT
goods
,brand_id
FROM goodsInfo
LATERAL VIEW explode(goods.brand_id) brands as brands
LATERAL VIEW explode(brands) brand_id as brand_id
WHERE brand_id = '45c060b9-3645-49ad-86eb-65f3cd4e9081';
以上是关于如何通过嵌套数组字段(数组中的数组)过滤Spark sql?的主要内容,如果未能解决你的问题,请参考以下文章
Spark Sql 查询嵌套记录。我想先过滤嵌套的记录数组,然后爆炸(将它们展开成行)
Spark使用DataFrame读取复杂JSON中的嵌套数组