PySpark 无法从 hdfs 读取 csv:HiveExternalCatalog 错误

Posted

技术标签:

【中文标题】PySpark 无法从 hdfs 读取 csv:HiveExternalCatalog 错误【英文标题】:PySpark Unable to read csv from hdfs: HiveExternalCatalog error 【发布时间】:2019-08-13 10:18:20 【问题描述】:

我是 spark 新手,一直在尝试调试错误。我正在尝试从 hdfs 读取多个文件。我为此使用 sparksession.read.csv,但出现错误:

py4j.protocol.Py4JJavaError:调用 o64.csv 时出错。 : java.lang.NoClassDefFoundError: org/apache/spark/sql/hive/HiveExternalCatalog

我在 cloudera 的社区上读到过,所有执行者都必须能够访问蜂巢罐。我尝试通过 --jar 选项添加它们,但无济于事。

jar 确实显示在驱动程序 web-ui @ 端口 4040

这是我的代码:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession

APP_NAME='Test'
file_path = 'hdfs:///csv_files/test.csv'

if __name__ == '__main__':
    conf = SparkConf().setAppName(APP_NAME)
    spark = SparkSession.builder.config(conf=conf).appName(APP_NAME).getOrCreate()
        spark_df = spark.read.csv(file_path)
    spark_df.printSchema()
    spark.stop()````

并提交给 spark:

    sudo -u spark PYSPARK_PYTHON=./parallelPython/env/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./parallelPython/env/bin/python --master yarn --jars $HIVE_CLASSPATH --archives env.zip#parallelPython parallelTestHive.py

错误:

Traceback (most recent call last):
  File "/home/ubuntu/parallelPython/parallelPython/parallelTestHive.py", line 63, in <module>
    spark_df = spark.read.csv('hdfs:///csv_files/1.csv')
  File "/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 472, in csv
  File "/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o64.csv.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/hive/HiveExternalCatalog
    at org.apache.spark.sql.query.analysis.QueryAnalysis$.hiveCatalog(QueryAnalysis.scala:69)
    at org.apache.spark.sql.query.analysis.QueryAnalysis$.getLineageInfo(QueryAnalysis.scala:88)
    at com.cloudera.spark.lineage.NavigatorQueryListener.onSuccess(ClouderaNavigatorListener.scala:60)
    at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:124)
    at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:123)
    at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:145)
    at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:143)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
    at org.apache.spark.sql.util.ExecutionListenerManager.org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling(QueryExecutionListener.scala:143)
    at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply$mcV$sp(QueryExecutionListener.scala:123)
    at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:123)
    at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:123)
    at org.apache.spark.sql.util.ExecutionListenerManager.readLock(QueryExecutionListener.scala:156)
    at org.apache.spark.sql.util.ExecutionListenerManager.onSuccess(QueryExecutionListener.scala:122)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3367)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
    at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:232)
    at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:68)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:179)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:179)
    at scala.Option.orElse(Option.scala:289)
    at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:178)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:372)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:615)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.hive.HiveExternalCatalog
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 41 more

【问题讨论】:

【参考方案1】:

当库配置不正确时会发生此错误。您可以尝试包含CSV 包,这将从spark-package.org 下载依赖项并添加到当前会话的classpath

 $SPARK_HOME/bin/spark-submit --packages com.databricks:spark-csv_2.11:1.5.0 ... ... 

【讨论】:

以上是关于PySpark 无法从 hdfs 读取 csv:HiveExternalCatalog 错误的主要内容,如果未能解决你的问题,请参考以下文章

从 hdfs 读取文件 - pyspark

使用 pyspark 从 hdfs 读取文件时连接被拒绝

使用python读取1TB HDFS csv文件的有效方法是啥

从 pyspark 读取 csv 指定模式错误类型

pyspark 遍历 hdfs 目录并将数据加载到多个表中

Pyspark 从 S3 存储桶读取 csv 文件:AnalysisException:路径不存在