Spark读写XML文件及注意事项
Posted 浪尖聊大数据
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark读写XML文件及注意事项相关的知识,希望对你有一定的参考价值。
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.11</artifactId>
<version>0.9.0</version>
</dependency>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>
An in-depth look at creating applications
with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage,
and query XML data in the database.
After introducing you to the heart of Oracle XML DB, namely the XMLType framework and Oracle XML DB repository,
the manual provides a brief introduction to design criteria to consider when planning your Oracle XML DB
application. It provides examples of how and where you can use Oracle XML DB.
The manual then describes ways you can store and retrieve XML data using Oracle XML DB, APIs for manipulating
XMLType data, and ways you can view, generate, transform, and search on existing XML data. The remainder of
the manual discusses how to use Oracle XML DB repository, including versioning and security,
how to access and manipulate repository resources using protocols, SQL, PL/SQL, or Java, and how to manage
your Oracle XML DB application using Oracle Enterprise Manager. It also introduces you to XML messaging and
Oracle Streams Advanced Queuing XMLType support.
</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
</book>
</catalog>
package com.vivo.study.xml
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.types.StructType
object ReadBooksXMLWithNestedArray {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val df = spark.sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load("data/books_complex.xml")
df.printSchema()
df.show()
df.foreach(row=>{
println(""+row.getAs("author")+","+row.getAs("_id"))
println(row.getStruct(4).getAs("country"))
println(row.getStruct(4).getClass)
val arr = row.getStruct(7).getList(0)
for (i<-0 to arr.size-1){
val b = arr.get(i).asInstanceOf[GenericRowWithSchema]
println(""+b.getAs("name") +","+b.getAs("location"))
}
})
}
}
root
|-- _id: string (nullable = true)
|-- author: string (nullable = true)
|-- description: string (nullable = true)
|-- genre: string (nullable = true)
|-- otherInfo: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- addressline1: string (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- country: string (nullable = true)
| |-- language: string (nullable = true)
| |-- pagesCount: long (nullable = true)
|-- price: double (nullable = true)
|-- publish_date: string (nullable = true)
|-- stores: struct (nullable = true)
| |-- store: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- location: string (nullable = true)
| | | |-- name: string (nullable = true)
|-- title: string (nullable = true)
-
我们并没有指定schema信息,但是却打印出来了schema信息,说明spark sql自己推断出了xml格式文件的schema。
-
嵌套深层数组类型的数据格式,并且带schema的,他的读取方式。浪尖这里也给出了案例。 -
rowTag就是 xml文件的row tag,其实还有一个root tag就是xml文件的root tag。 -
_id 字段是属于XML自身的字段,为了区分加了前缀 下划线 _ 。当然前缀 是下划线你假如看不惯的话,那就完全可以通过attributePrefix属性来修改。这些属性由于不属于用户,假如不关心,可以直接禁止掉,参数是excludeAttribute。
df2.write
.format("com.databricks.spark.xml")
.option("rootTag", "books")
.option("rowTag", "book")
.save("src/main/resources/books_new.xml")
package com.vivo.study.xml
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.types._
object ReadBooksXMLWithNestedArrayStruct {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[1]")
.appName("langjian")
.getOrCreate()
val customSchema = StructType(Array(
StructField("_id", StringType, nullable = true),
StructField("author", StringType, nullable = true),
StructField("description", StringType, nullable = true),
StructField("genre", StringType ,nullable = true),
StructField("price", DoubleType, nullable = true),
StructField("publish_date", StringType, nullable = true),
StructField("title", StringType, nullable = true),
StructField("otherInfo",StructType(Array(
StructField("pagesCount", StringType, nullable = true),
StructField("language", StringType, nullable = true),
StructField("country", StringType, nullable = true),
StructField("address", StructType(Array(
StructField("addressline1", StringType, nullable = true),
StructField("city", StringType, nullable = true),
StructField("state", StringType, nullable = true)
))
))
)),
StructField("stores",StructType(Array(
StructField("store",ArrayType(
StructType(Array(
StructField("location",StringType,true),
StructField("name",StringType,true)
))
))
)))
))
val df = spark.sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.schema(customSchema)
.load("data/books_complex.xml")
df.printSchema()
df.show()
df.foreach(row=>{
println(""+row.getAs("author")+","+row.getAs("_id"))
println(row.getAs[GenericRowWithSchema]("otherInfo").getAs("country"))
println(row.getStruct(7).getClass)
val arr = row.getStruct(8).getList(0)
for (i<-0 to arr.size-1){
val b = arr.get(i).asInstanceOf[GenericRowWithSchema]
println(""+b.getAs("name") +","+b.getAs("location"))
}
})
}
}
root
|-- _id: string (nullable = true)
|-- author: string (nullable = true)
|-- description: string (nullable = true)
|-- genre: string (nullable = true)
|-- price: double (nullable = true)
|-- publish_date: string (nullable = true)
|-- title: string (nullable = true)
|-- otherInfo: struct (nullable = true)
| |-- pagesCount: string (nullable = true)
| |-- language: string (nullable = true)
| |-- country: string (nullable = true)
| |-- address: struct (nullable = true)
| | |-- addressline1: string (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
|-- stores: struct (nullable = true)
| |-- store: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- location: string (nullable = true)
| | | |-- name: string (nullable = true)
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>
An in-depth look at creating applications
with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage,
and query XML data in the database.
After introducing you to the heart of Oracle XML DB, namely the XMLType framework and Oracle XML DB repository,
the manual provides a brief introduction to design criteria to consider when planning your Oracle XML DB
application. It provides examples of how and where you can use Oracle XML DB.
The manual then describes ways you can store and retrieve XML data using Oracle XML DB, APIs for manipulating
XMLType data, and ways you can view, generate, transform, and search on existing XML data. The remainder of
the manual discusses how to use Oracle XML DB repository, including versioning and security,
how to access and manipulate repository resources using protocols, SQL, PL/SQL, or Java, and how to manage
your Oracle XML DB application using Oracle Enterprise Manager. It also introduces you to XML messaging and
Oracle Streams Advanced Queuing XMLType support.
</description>
<otherInfo>
<pagesCount>100</pagesCount>
<language>english</language>
<country>India</country>
<address>
<addressline1>3417 south plaza dr</addressline1>
<city>Costa mesa</city>
<state>CA</state>
</address>
</otherInfo>
<stores>
<store>
<name>Costco</name>
<location>usa</location>
</store>
<store>
<name>Target</name>
<location>UK</location>
</store>
</stores>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
<otherInfo>
<pagesCount>100</pagesCount>
<language>english</language>
<country>India</country>
<address>
<addressline1>3417 south plaza dr</addressline1>
<city>Costa mesa</city>
<state>CA</state>
</address>
</otherInfo>
<stores>
<store>
<name>Costco</name>
</store>
<store>
<name>Target</name>
</store>
<store>
<name>Walmart</name>
</store>
</stores>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
<otherInfo>
<pagesCount>100</pagesCount>
<language>english</language>
<country>India</country>
<address>
<addressline1>3417 south plaza dr</addressline1>
<city>Costa mesa</city>
<state>CA</state>
</address>
</otherInfo>
<stores>
<store>
<name>Costco</name>
</store>
</stores>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
<otherInfo>
<pagesCount>100</pagesCount>
<language>english</language>
<country>India</country>
<address>
<addressline1>3417 south plaza dr</addressline1>
<city>Costa mesa</city>
<state>CA</state>
</address>
</otherInfo>
<stores>
<store>
<name>Costco</name>
</store>
</stores>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
<otherInfo>
<pagesCount>100</pagesCount>
<language>english</language>
<country>India</country>
<address>
<addressline1>3417 south plaza dr</addressline1>
<city>Costa mesa</city>
<state>CA</state>
</address>
</otherInfo>
<stores>
<store>
<name>Costco</name>
</store>
</stores>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
<otherInfo>
<pagesCount>100</pagesCount>
<language>english</language>
<country>India</country>
<address>
<addressline1>3417 south plaza dr</addressline1>
<city>Costa mesa</city>
<state>CA</state>
</address>
</otherInfo>
<stores>
<store>
<name>Costco</name>
</store>
</stores>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
<otherInfo>
<pagesCount>100</pagesCount>
<language>english</language>
<country>India</country>
<address>
<addressline1>3417 south plaza dr</addressline1>
<city>Costa mesa</city>
<state>CA</state>
</address>
</otherInfo>
<stores>
<store>
<name>Costco</name>
</store>
</stores>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
<otherInfo>
<pagesCount>100</pagesCount>
<language>english</language>
<country>India</country>
<address>
<addressline1>3417 south plaza dr</addressline1>
<city>Costa mesa</city>
<state>CA</state>
</address>
</otherInfo>
<stores>
<store>
<name>Costco</name>
</store>
</stores>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
<otherInfo>
<pagesCount>100</pagesCount>
<language>english</language>
<country>India</country>
<address>
<addressline1>3417 south plaza dr</addressline1>
<city>Costa mesa</city>
<state>CA</state>
</address>
</otherInfo>
<stores>
<store>
<name>Costco</name>
</store>
</stores>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
<otherInfo>
<pagesCount>100</pagesCount>
<language>english</language>
<country>India</country>
<address>
<addressline1>3417 south plaza dr</addressline1>
<city>Costa mesa</city>
<state>CA</state>
</address>
</otherInfo>
<stores>
<store>
<name>Costco</name>
</store>
</stores>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
<otherInfo>
<pagesCount>100</pagesCount>
<language>english</language>
<country>India</country>
<address>
<addressline1>3417 south plaza dr</addressline1>
<city>Costa mesa</city>
<state>CA</state>
</address>
</otherInfo>
<stores>
<store>
<name>Costco</name>
</store>
</stores>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
<otherInfo>
<pagesCount>100</pagesCount>
<language>english</language>
<country>India</country>
<address>
<addressline1>3417 south plaza dr</addressline1>
<city>Costa mesa</city>
<state>CA</state>
</address>
</otherInfo>
<stores>
<store>
<name>Costco</name>
</store>
</stores>
</book>
</catalog>
以上是关于Spark读写XML文件及注意事项的主要内容,如果未能解决你的问题,请参考以下文章
使用Spark读写Parquet文件验证Parquet自带表头的性质及NULL值来源Java
使用Spark读写Parquet文件验证Parquet自带表头的性质及NULL值来源Java
使用Spark读写Parquet文件验证Parquet自带表头的性质及NULL值来源Java
spark关于join后有重复列的问题(org.apache.spark.sql.AnalysisException: Reference '*' is ambiguous)(代码片段