Spark操作外部数据源--parquet
Posted arthurlance
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark操作外部数据源--parquet相关的知识,希望对你有一定的参考价值。
处理parquet数据
RuntimeException: file:/Users/arthurlance/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json is not a Parquet file
val DEFAULT_DATA_SOURCE_NAME = SQLConfigBuilder("spark.sql.sources.default")
.doc("The default data source to use in input/output.")
.stringConf
.createWithDefault("parquet")
#注意USING的用法
CREATE TEMPORARY VIEW parquetTable
USING org.apache.spark.sql.parquet
OPTIONS (
path "/Users/arthurlance/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet"
)
SELECT * FROM parquetTable
spark.sql("select deptno, count(1) as mount from emp where group by deptno").filter("deptno is not null").write.saveAsTable("hive_table_1")
org.apache.spark.sql.AnalysisException: Attribute name "count(1)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
spark.sqlContext.setConf("spark.sql.shuffle.partitions","10")
在生产环境中一定要注意设置spark.sql.shuffle.partitions,默认是200
1 package com.imooc.spark 2 3 import org.apache.spark.sql.SparkSession 4 5 /** 6 * Parquet文件操作 7 */ 8 object ParquetApp { 9 10 def main(args: Array[String]) { 11 12 val spark = SparkSession.builder().appName("SparkSessionApp") 13 .master("local[2]").getOrCreate() 14 15 16 /** 17 * spark.read.format("parquet").load 这是标准写法 18 */ 19 val userDF = spark.read.format("parquet").load("file:///Users/arthurlance/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet") 20 21 userDF.printSchema() 22 userDF.show() 23 24 userDF.select("name","favorite_color").show 25 26 userDF.select("name","favorite_color").write.format("json").save("file:///Users/arthurlance/tmp/jsonout") 27 28 spark.read.load("file:///Users/arthurlance/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet").show 29 30 //会报错,因为sparksql默认处理的format就是parquet 31 spark.read.load("file:///Users/arthurlance/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json").show 32 33 spark.read.format("parquet").option("path","file:///Users/arthurlance/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet").load().show 34 spark.stop() 35 } 36 37 }
以上是关于Spark操作外部数据源--parquet的主要内容,如果未能解决你的问题,请参考以下文章