Spark Standalone如何将本地.jar文件传递给集群

Posted 2023-04-18

技术标签:

【中文标题】Spark Standalone如何将本地.jar文件传递给集群【英文标题】：Spark Standalone how to pass local .jar file to cluster 【发布时间】：2020-03-13 10:37:52 【问题描述】：

我有一个包含两个工人和一个主人的集群。要启动 master 和 worker，我在 master 机器上使用 sbin/start-master.sh 和 sbin/start-slaves.sh。然后，主 UI 向我显示从站是 ALIVE 的（所以，到目前为止一切正常）。当我想使用spark-submit 时出现问题。

我在我的本地机器中执行这个命令：

spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster /home/user/example.jar

但是弹出如下错误：ERROR ClientEndpoint: Exception from cluster was: java.nio.file.NoSuchFileException: /home/user/example.jar

我一直在对堆栈溢出和 Spark 的文档进行一些研究，似乎我应该将 spark-submit 命令的 application-jar 指定为“包含您的应用程序和所有依赖项的捆绑 jar 的路径。URL 必须是全局的在集群内部可见，例如所有节点上都存在的 hdfs:// 路径或 file:// 路径。” （因为它表示https://spark.apache.org/docs/latest/submitting-applications.html）。

我的问题是：如何将我的 .jar 设置为在集群内全局可见？这里有一个类似的问题Spark Standalone cluster cannot read the files in local filesystem，但解决方案对我不起作用。

另外，我是否在使用sbin/start-master.sh 初始化我的主计算机内的集群，然后在我的本地计算机上执行spark-submit 时做错了什么？我在我的主终端中初始化了主，因为我在 Spark 的文档中读过，但这可能与问题有关。来自 Spark 的文档：

Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop’s deploy scripts, and available in SPARK_HOME/sbin: [...] Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.

非常感谢

编辑： 我已经在每个工人中复制了文件 .jar 并且它可以工作。但我的意思是要知道是否有更好的方法，因为这种方法让我每次创建新 jar 时都将 .jar 复制到每个工作人员。（这是已发布链接Spark Standalone cluster cannot read the files in local filesystem问题的答案之一）

【问题讨论】：

您是否尝试在运行spark-submit 时用--jars example.jar 指示在哪里可以找到jar 文件？嗨奥利，感谢您的回答！你会怎么做？如果我在上面写的整个命令之后使用--jars example.jar，它仍然会给我同样的错误（NoSuchFileException）。而如果我不提供上述路径，而是写 --jars example.jar 或 --jars /home/user/example.jar 它会给我错误：Missing application resource。请尝试提供 --class 选项，如下所示 spark-submit --master spark://:7077 --deploy-mode cluster --jars /home /user/example.jar --class 嗨萨拉特！感谢您的回答。我试过了，spark-submit 给了我错误Missing application resource.（并为我提供了spark-submit 可用的选项） 【参考方案1】：

@meisan 你的spark-submit 命令遗漏了两件事。

你的罐子应该加上标志--jar 保存驱动程序代码的文件，即主函数。

现在，如果您使用的是 scala 或 python，您还没有指定任何地方，但简而言之，您的命令将类似于：

python：

spark-submit --master spark://<master>:7077 --deploy-mode cluster --jar <dependency-jars> <python-file-holding-driver-logic>

对于 scala：

spark-submit --master spark://<master>:7077 --deploy-mode cluster --class <scala-driver-class> --driver-class-path <application-jar> --jar <dependency-jars>

此外，当您使用记录的标志时，spark 会负责将所需的文件和 jars 发送给执行程序。如果要省略--driver-class-path 标志，可以将环境变量SPARK_CLASSPATH 设置为放置所有jar 的路径。

【讨论】：

以上是关于Spark Standalone如何将本地.jar文件传递给集群的主要内容，如果未能解决你的问题，请参考以下文章

linux平台 spark standalone集群使用 start-all，stop-all 管理集群的启动和退出

SparkSpark的Standalone模式安装部署

Spark Standalone + Zeppelin + Docker：如何设置 SPARK_HOME

Spark的运行模式--Local和Standalone

将本地jar添加进本地仓库