多节点 hadoop 集群中的 Apache Spark Sql 问题

Posted

技术标签:

【中文标题】多节点 hadoop 集群中的 Apache Spark Sql 问题【英文标题】:Apache Spark Sql issue in multi node hadoop cluster 【发布时间】:2015-08-04 08:50:12 【问题描述】:

您好,我正在使用 Spark java api 从 hive 中获取数据。此代码在 hadoop 单节点集群中工作。但是当我尝试在 hadoop 多节点集群中使用它时,它会抛出错误

org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.

注意:我使用 master 作为单节点本地,多节点使用 yarn-cluster。

这是我的java代码

 SparkConf sparkConf = new SparkConf().setAppName("Hive").setMaster("yarn-cluster");
 JavaSparkContext ctx = new JavaSparkContext(sparkConf);
 HiveContext sqlContext = new HiveContext(ctx.sc());
org.apache.spark.sql.Row[] result = sqlContext.sql("Select * from Tablename").collect();

我也尝试将 master 更改为本地,现在它抛出了未知的主机名异常。 任何人都可以帮助我吗?

更新

错误日志

15/08/05 11:30:25 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
15/08/05 11:30:25 INFO ObjectStore: Initialized ObjectStore
15/08/05 11:30:25 INFO HiveMetaStore: Added admin role in metastore
15/08/05 11:30:25 INFO HiveMetaStore: Added public role in metastore
15/08/05 11:30:25 INFO HiveMetaStore: No user is added in admin role, since config is empty
15/08/05 11:30:25 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr.
15/08/05 11:30:25 INFO HiveMetaStore: 0: get_table : db=default tbl=activity
15/08/05 11:30:25 INFO audit: ugi=labuser   ip=unknown-ip-addr  cmd=get_table : db=default tbl=activity 
15/08/05 11:30:25 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead
15/08/05 11:30:25 INFO deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/08/05 11:30:26 INFO MemoryStore: ensureFreeSpace(399000) called with curMem=0, maxMem=1030823608
15/08/05 11:30:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 389.6 KB, free 982.7 MB)
15/08/05 11:30:26 INFO MemoryStore: ensureFreeSpace(34309) called with curMem=399000, maxMem=1030823608
15/08/05 11:30:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 33.5 KB, free 982.7 MB)
15/08/05 11:30:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.100.7:61775 (size: 33.5 KB, free: 983.0 MB)
15/08/05 11:30:26 INFO SparkContext: Created broadcast 0 from collect at Hive.java:29
Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoopcluster
    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:258)
    at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:602)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:547)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:139)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1783)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:885)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:884)
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:105)
    at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1255)
    at com.Hive.main(Hive.java:29)
Caused by: java.net.UnknownHostException: hadoopcluster
    ... 44 more

【问题讨论】:

【参考方案1】:

如异常所示,不能直接从SparkContext 使用yarn-cluster 模式。但是您可以使用SparkContext 在独立的多节点集群上运行它。首先,您必须启动独立的 spark 集群,然后设置 sparkConf.setMaster("spark://HOST:PORT"),其中 HOST:PORT 是您的 spark 集群的 URL。我希望这能解决你的问题。

【讨论】:

...我已经按照您所说的尝试创建了一个 spark stanalone 多节点集群,并尝试通过将 spark master url 设置为 spark://HOST:PORT 来运行此查询。但它会抛出 Unknown Host 异常:hadoopcluster,其中 hadoopcluster 是多节点 hadoop 集群的名称。请在我编辑的问题部分中找到我的错误日志 我怀疑他无法解析主机名“hadoopcluster”。尝试使用 master 正在运行的机器的 IP。 ...是否要在core-site.xml等文件中添加hadoop主机的ip 我不这么认为。您可以在输入映射(IP 到主机名)的位置运行 DNS 服务,或者在 hosts 文件中插入映射。但是您必须将它插入到必须进行名称解析的每台机器上的所有 hosts 文件中。 我已经通过将 hdf-site.xml 复制到 spark/conf 文件夹解决了这个异常

以上是关于多节点 hadoop 集群中的 Apache Spark Sql 问题的主要内容,如果未能解决你的问题,请参考以下文章

设置多节点 Hadoop Hortonworks 集群

Spark集群安装-基于hadoop集群

如何基于Docker快速搭建多节点Hadoop集群

Apache Cassandra随笔之多节点跨数据中心集群配置以及日常操作

Hadoop多节点集群安装配置

启动hadoop集群的时候jobtracker日志里报这个异常,求高手指导是啥原因导致的,怎么解决