为啥使用 DataFrame 时 Spark 会报告“java.net.URISyntaxException:绝对 URI 中的相对路径”?

Posted

技术标签:

【中文标题】为啥使用 DataFrame 时 Spark 会报告“java.net.URISyntaxException:绝对 URI 中的相对路径”?【英文标题】:Why does Spark report "java.net.URISyntaxException: Relative path in absolute URI" when working with DataFrames?为什么使用 DataFrame 时 Spark 会报告“java.net.URISyntaxException:绝对 URI 中的相对路径”? 【发布时间】:2016-08-14 08:02:14 【问题描述】:

我在 Windows 机器上本地运行 Spark。我能够成功启动 spark shell 并将文本文件作为 RDD 读取。我还能够遵循有关此主题的各种在线教程,并能够对 RDD 执行各种操作。

但是,当我尝试将 RDD 转换为 DataFrame 时,出现错误。这就是我正在做的事情:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

//convert rdd to df
val df = rddFile.toDF()

此代码生成一长串似乎与以下错误消息相关的错误消息:

Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/Users/spark/spark-warehouse
        at org.apache.hadoop.fs.Path.initialize(Path.java:205)
        at org.apache.hadoop.fs.Path.<init>(Path.java:171)
        at org.apache.hadoop.hive.metastore.Warehouse.getWhRoot(Warehouse.java:159)
        at org.apache.hadoop.hive.metastore.Warehouse.getDefaultDatabasePath(Warehouse.java:177)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB_core(HiveMetaStore.java:600)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
        at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
        at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
        ... 85 more
Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:C:/Users/spark/spark-warehouse
        at java.net.URI.checkPath(URI.java:1823)
        at java.net.URI.<init>(URI.java:745)
        at org.apache.hadoop.fs.Path.initialize(Path.java:202)
        ... 96 more

整个堆栈跟踪如下。

16/08/16 12:36:20 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/08/16 12:36:20 WARN Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
        at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
        at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
        at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
        at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:171)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
        at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
        at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
        at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
        at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
        at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
        at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
        at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
        at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
        at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<init>(HiveSessionState.scala:63)
        at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
        at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
        at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:401)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:342)
        at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24)
        at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:29)
        at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:31)
        at $line14.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:33)
        at $line14.$read$$iw$$iw$$iw$$iw.<init>(<console>:35)
        at $line14.$read$$iw$$iw$$iw.<init>(<console>:37)
        at $line14.$read$$iw$$iw.<init>(<console>:39)
        at $line14.$read$$iw.<init>(<console>:41)
        at $line14.$read.<init>(<console>:43)
        at $line14.$read$.<init>(<console>:47)
        at $line14.$read$.<clinit>(<console>)
        at $line14.$eval$.$print$lzycompute(<console>:7)
        at $line14.$eval$.$print(<console>:6)
        at $line14.$eval.$print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
        at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
        at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
        at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
        at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
        at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
        at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
        at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
        at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:415)
        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:923)
        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
        at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
        at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
        at org.apache.spark.repl.Main$.doMain(Main.scala:68)
        at org.apache.spark.repl.Main$.main(Main.scala:51)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
        at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
        at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
        at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
        at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
        ... 74 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
        ... 80 more
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/Users/spark/spark-warehouse
        at org.apache.hadoop.fs.Path.initialize(Path.java:205)
        at org.apache.hadoop.fs.Path.<init>(Path.java:171)
        at org.apache.hadoop.hive.metastore.Warehouse.getWhRoot(Warehouse.java:159)
        at org.apache.hadoop.hive.metastore.Warehouse.getDefaultDatabasePath(Warehouse.java:177)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB_core(HiveMetaStore.java:600)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
        at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
        at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
        ... 85 more
Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:C:/Users/spark/spark-warehouse
        at java.net.URI.checkPath(URI.java:1823)
        at java.net.URI.<init>(URI.java:745)
        at org.apache.hadoop.fs.Path.initialize(Path.java:202)
        ... 96 more

【问题讨论】:

【参考方案1】:

SPARK-15565 issue in Spark 2.0 on Windows 有一个简单的解决方案(似乎是 part of Spark's codebase,可能很快会作为 2.0.2 或 2.1.0 发布)。

Spark 2.0.0 中的解决方案是将spark.sql.warehouse.dir 设置为某个正确引用的目录,例如使用///(三斜杠)的file:///c:/Spark/spark-2.0.0-bin-hadoop2.7/spark-warehouse

--conf 参数开始spark-shell,如下所示:

spark-shell --conf spark.sql.warehouse.dir=file:///c:/tmp/spark-warehouse

或者使用新的 fluent builder 模式在您的 Spark 应用程序中创建一个SparkSession,如下所示:

import org.apache.spark.sql.SparkSession
SparkSession spark = SparkSession
  .builder()
  .config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
  .getOrCreate()

或者创建conf/spark-defaults.conf,内容如下:

spark.sql.warehouse.dir file:///c:/tmp/spark-warehouse

【讨论】:

这个问题在 Spark 2.2.0 上仍然出现在我看来,但我使用的是谷歌云存储,所以我有这样的路径 gs://prefix:remaining 我应该怎么做才能解决这个问题。我无法删除中间冒号。我无法控制?【参考方案2】:

如果您确实想在代码中修复它而不触及现有代码,也可以从系统属性中传递它,这样之后的火花初始化就不会改变。

System.setProperty(
    "spark.sql.warehouse.dir", 
    s"file:///$System.getProperty("user.dir")/spark-warehouse"
    .replaceAll("\\\\", "/")
)

注意,这也是使用当前工作目录,可以替换为“c:/tmp/”,或者任何你想要的 spark-warehouse 目录。

【讨论】:

【参考方案3】:

我也遇到了这个错误。就我而言,我正在编写一个名称为 ex:/tmp/dattaset_10:23:11 的带有字符“:”的文件。然后 spark 会感到困惑,因为认为 : 是路径的一部分(如 C:\)。

解决方法就是去掉文件名中的:字符

【讨论】:

【参考方案4】:

只需设置 config("spark.sql.warehouse.dir", "file:///tmp/spark-warehouse")。

无需包含 "C:/"

Spark 驱动程序会自动在 /tmp/spark-warehouse 目录下创建文件夹。如果是 windows,它将在 "C:/" 下。 "C:" 驱动器如果主不在 Windows 上的本地驱动器将无法工作。

【讨论】:

【参考方案5】:

我设置 spark.sql.warehouse.dir 属性来修复现有代码中的错误

System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse");

这里是代码sn-p

System.setProperty("hadoop.home.dir", "c:/winutil/");
System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse");
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
val lines = sc.textFile("C:/user.txt")

【讨论】:

以上是关于为啥使用 DataFrame 时 Spark 会报告“java.net.URISyntaxException:绝对 URI 中的相对路径”?的主要内容,如果未能解决你的问题,请参考以下文章

为啥 Spark DataFrame 会创建错误数量的分区?

为啥 DataFrame 在 spark 2.2 中仍然存在,甚至 DataSet 在 scala 中也提供了更多的性能? [复制]

解决spark dataframe get 报空指针异常 java.lang.NullPointerException

为啥即使使用限制命令访问结果,SPARK \PYSPARK 也会计算所有内容?

使用Scala在Spark中创建DataFrame时出错

在 Spark 中使用 Dataframe 获取平均值