使用模式验证在 pyspark 中加载 geoJSON
Posted
技术标签:
【中文标题】使用模式验证在 pyspark 中加载 geoJSON【英文标题】:Loading geoJSON in pyspark with Schema validation 【发布时间】:2018-06-12 08:59:44 【问题描述】:我正在尝试创建一个架构来验证 GeoJSON 正在加载的文件:
validSchema = StructType([
StructField("type", StringType()),
StructField("geometry", StructType([
StructField("coordinates", ArrayType(DoubleType())), # POINT
StructField("coordinates", ArrayType(ArrayType(ArrayType(DoubleType())))), # POLYGON
StructField("coordinates", ArrayType(ArrayType(DoubleType()))), # LINESTRING
StructField("type", StringType(), False)
]), False),
StructField("properties", MapType(StringType(), StringType()))
])
df = spark.read.option("multiline","true").json(src_data,mode="PERMISSIVE",schema=validSchema)
问题是我有三种“坐标”来满足有效的 GeoJSON 类型。但是,只有最后一条规则有效,我假设它根据顺序优先于前两条。
有没有将模式指定为坐标模式之一必须匹配?
现在我能看到的唯一方法是创建三个模式和三个导入,这意味着扫描所有数据三次(我有 5TB 的数据,所以这看起来很疯狂)。
geoJSON 数据示例:
"type": "Feature",
"properties": ,
"geometry":
"type": "Polygon",
"coordinates": [[[ -0.144195556640625,52.019120643633386],
[-0.127716064453125,52.00052411347729],
[-0.10848999023437499,52.01193653675363],
[-0.12359619140625,52.02883848153626],
[-0.144195556640625,52.019120643633386]]]
,
"type": "Feature",
"properties": ,
"geometry":
"type": "LineString",
"coordinates": [[-0.196380615234375,52.11283076186275],
[-0.1263427734375,52.07739600418385]]
,
"type": "Feature",
"properties": ,
"geometry":
"type": "Point",
"coordinates": [-0.1641082763671875, 52.06051241654061]
谢谢
【问题讨论】:
【参考方案1】:有没有将模式指定为坐标模式之一必须匹配?
UserDefinedTypes
(不再支持)尽管如此,Column
中的所有值都必须具有相同的形状,因此您不能同时拥有 array<array<array<double>>>
、array<array<double>>
和 array<double>
。
你可以完全跳过解析
validSchema = StructType([
StructField("type", StringType()),
StructField("geometry", StructType([
StructField("coordinates", StringType()),
StructField("type", StringType(), False)
]), False),
StructField("properties", MapType(StringType(), StringType()))
])
然后用udf
将其解析为三个单独的列:
from pyspark.sql.functions import udf
import json
@udf("struct<type: string, coordinates: struct<polygon: array<array<struct<lon: double, lat: double>>>, line: array<struct<lon: double, lat: double>>, point: struct<lon: double, lat: double>>>")
def parse(row):
try:
struct = json.loads(row["coordinates"])
t = row["type"]
except (TypeError, json.decoder.JSONDecodeError):
pass
if t == "Polygon":
return t, (struct, None, None)
elif t == "LineString":
return t, (None, struct, None)
elif t == "Point":
return t, (None, None, struct)
sdf.select(parse("geometry")).show(truncate=False)
# +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |parse(geometry) |
# +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |[Polygon, [[[[-0.144195556640625, 52.019120643633386], [-0.127716064453125, 52.00052411347729], [-0.10848999023437499, 52.01193653675363], [-0.12359619140625, 52.02883848153626], [-0.144195556640625, 52.019120643633386]]],,]]|
# |[LineString, [, [[-0.196380615234375, 52.11283076186275], [-0.1263427734375, 52.07739600418385]],]] |
# |[Point, [,, [-0.1641082763671875, 52.06051241654061]]] |
# +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
【讨论】:
感谢您的回复。我认为您说不可能拥有多个模式,从而一针见血。我将尝试使用您在此处描述和反馈的功能。以上是关于使用模式验证在 pyspark 中加载 geoJSON的主要内容,如果未能解决你的问题,请参考以下文章
验证插件 'mysql_clear_password' 无法在 Navicat 中加载
如何使用 asp.net core mvc 在引导模式中加载 AngularJS 代码