从外部在 Hortonworks Sandbox 上执行 Spark 作业
Posted
技术标签:
【中文标题】从外部在 Hortonworks Sandbox 上执行 Spark 作业【英文标题】:Execute Spark job on Hortonworks Sandbox from outside 【发布时间】:2018-04-04 11:37:55 【问题描述】:我正在使用 VirtualBox 将 Hortonworks Sandbox 作为虚拟机运行。
使用本地计算机中的 IDE (IntelliJ Idea),我尝试从本地计算机在沙盒虚拟机上执行 Spark 作业,但没有成功。
这是火花作业代码:
import org.apache.spark.SparkConf, SparkContext
object HelloWorld
def main(args: Array[String]): Unit =
val logFile = "file:///tmp/words.txt" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://127.0.0.1:4040")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
我从执行中得到的错误日志是:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/04/04 13:16:50 INFO SparkContext: Running Spark version 2.2.0
18/04/04 13:16:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/04 13:16:50 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378)
...
18/04/04 13:16:50 INFO SparkContext: Submitted application: Simple Application
18/04/04 13:16:50 INFO SecurityManager: Changing view acls to: jaramos
18/04/04 13:16:50 INFO SecurityManager: Changing modify acls to: jaramos
18/04/04 13:16:50 INFO SecurityManager: Changing view acls groups to:
18/04/04 13:16:50 INFO SecurityManager: Changing modify acls groups to:
18/04/04 13:16:50 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jaramos); groups with view permissions: Set(); users with modify permissions: Set(jaramos); groups with modify permissions: Set()
18/04/04 13:16:51 INFO Utils: Successfully started service 'sparkDriver' on port 54849.
18/04/04 13:16:51 INFO SparkEnv: Registering MapOutputTracker
18/04/04 13:16:51 INFO SparkEnv: Registering BlockManagerMaster
18/04/04 13:16:51 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/04/04 13:16:51 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/04/04 13:16:51 INFO DiskBlockManager: Created local directory at C:\Users\jaramos\AppData\Local\Temp\blockmgr-93e05db6-a65a-4a3f-b238-9cde5d918bc2
18/04/04 13:16:51 INFO MemoryStore: MemoryStore started with capacity 1986.6 MB
18/04/04 13:16:51 INFO SparkEnv: Registering OutputCommitCoordinator
18/04/04 13:16:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
18/04/04 13:16:51 INFO Utils: Successfully started service 'SparkUI' on port 4041.
18/04/04 13:16:51 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.75.1:4041
18/04/04 13:16:52 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://127.0.0.1:4040...
18/04/04 13:16:52 INFO TransportClientFactory: Successfully created connection to /127.0.0.1:4040 after 25 ms (0 ms spent in bootstraps)
18/04/04 13:16:52 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /127.0.0.1:4040 is closed
18/04/04 13:16:52 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 127.0.0.1:4040
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
...
Caused by: java.io.IOException: Connection from /127.0.0.1:4040 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
...
18/04/04 13:17:12 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://127.0.0.1:4040...
18/04/04 13:17:12 INFO TransportClientFactory: Found inactive connection to /127.0.0.1:4040, creating a new one.
18/04/04 13:17:12 INFO TransportClientFactory: Successfully created connection to /127.0.0.1:4040 after 2 ms (0 ms spent in bootstraps)
18/04/04 13:17:12 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /127.0.0.1:4040 is closed
18/04/04 13:17:12 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 127.0.0.1:4040
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
...
Caused by: java.io.IOException: Connection from /127.0.0.1:4040 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
...
18/04/04 13:17:32 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://127.0.0.1:4040...
18/04/04 13:17:32 INFO TransportClientFactory: Found inactive connection to /127.0.0.1:4040, creating a new one.
18/04/04 13:17:32 INFO TransportClientFactory: Successfully created connection to /127.0.0.1:4040 after 1 ms (0 ms spent in bootstraps)
18/04/04 13:17:32 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /127.0.0.1:4040 is closed
18/04/04 13:17:32 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 127.0.0.1:4040
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
...
Caused by: java.io.IOException: Connection from /127.0.0.1:4040 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
...
18/04/04 13:17:52 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
18/04/04 13:17:52 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.
18/04/04 13:17:52 INFO SparkUI: Stopped Spark web UI at http://10.0.75.1:4041
18/04/04 13:17:52 INFO StandaloneSchedulerBackend: Shutting down all executors
18/04/04 13:17:52 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
18/04/04 13:17:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54923.
18/04/04 13:17:52 INFO NettyBlockTransferService: Server created on 10.0.75.1:54923
18/04/04 13:17:52 WARN StandaloneAppClient$ClientEndpoint: Drop UnregisterApplication(null) because has not yet connected to master
18/04/04 13:17:52 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/04/04 13:17:52 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.75.1, 54923, None)
18/04/04 13:17:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/04/04 13:17:52 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.75.1:54923 with 1986.6 MB RAM, BlockManagerId(driver, 10.0.75.1, 54923, None)
18/04/04 13:17:52 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.75.1, 54923, None)
18/04/04 13:17:52 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.75.1, 54923, None)
18/04/04 13:17:52 INFO MemoryStore: MemoryStore cleared
18/04/04 13:17:52 INFO BlockManager: BlockManager stopped
18/04/04 13:17:52 INFO BlockManagerMaster: BlockManagerMaster stopped
18/04/04 13:17:52 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/04/04 13:17:52 INFO SparkContext: Successfully stopped SparkContext
18/04/04 13:17:52 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:524)
at HelloWorld$.main(HelloWorld.scala:8)
at HelloWorld.main(HelloWorld.scala)
18/04/04 13:17:52 INFO SparkContext: SparkContext already stopped.
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:524)
at HelloWorld$.main(HelloWorld.scala:8)
at HelloWorld.main(HelloWorld.scala)
18/04/04 13:17:52 INFO ShutdownHookManager: Shutdown hook called
18/04/04 13:17:52 INFO ShutdownHookManager: Deleting directory C:\Users\jaramos\AppData\Local\Temp\spark-0e2461c0-f3fa-402b-8fa9-d4e3ede388d1
如何连接到远程 Spark 机器?
提前致谢!
【问题讨论】:
【参考方案1】:使用端口映射来公开所有相关的 hadoop 和环境组件端口。例如 9083 用于 hive 元存储。 然后将您的 hive-site.xml 和 hdfs-site.xml 复制到您的 intellij 资源目录。它应该可以工作
【讨论】:
资源目录位于何处,如何将这些文件加载到其中以便 IntelliJ 使用它们? /scala/main/resources以上是关于从外部在 Hortonworks Sandbox 上执行 Spark 作业的主要内容,如果未能解决你的问题,请参考以下文章
hortonworks 网站不支持下载 HDP Sandbox
在 Hortonworks Sandbox 内的 Pig 脚本中加载 JSON 文件
Hortonworks 沙箱 - 无法启动,因为 ambari-qa-Sandbox@DOMAIN.COM 应该是 ambari-qa-sandbox@DOMAIN.COM