使用 spark-shell 安装包 Graphframes
Posted
技术标签:
【中文标题】使用 spark-shell 安装包 Graphframes【英文标题】:Install package Graphframes using spark-shell 【发布时间】:2021-06-11 12:23:27 【问题描述】:我正在尝试使用 spark-shell 安装 PySpark 包 Graphframes :
pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12
但是,终端中出现这样的错误:
root@hpcc:~# pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/root/spark-3.0.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/root/spark-3.0.2-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-bb0fc7e9-5af7-4189-98e4-7ac76a8d97a9;1.0
confs: [default]
:: resolution report :: resolve 2691ms :: artifacts dl 1ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: graphframes#graphframes;0.8.1-spark3.0-s_2.12
==== local-m2-cache: tried
file:/root/.m2/repository/graphframes/graphframes/0.8.1-spark3.0-s_2.12/graphframes-0.8.1-spark3.0-s_2.12.pom
-- artifact graphframes#graphframes;0.8.1-spark3.0-s_2.12!graphframes.jar:
file:/root/.m2/repository/graphframes/graphframes/0.8.1-spark3.0-s_2.12/graphframes-0.8.1-spark3.0-s_2.12.jar
==== local-ivy-cache: tried
/root/.ivy2/local/graphframes/graphframes/0.8.1-spark3.0-s_2.12/ivys/ivy.xml
-- artifact graphframes#graphframes;0.8.1-spark3.0-s_2.12!graphframes.jar:
/root/.ivy2/local/graphframes/graphframes/0.8.1-spark3.0-s_2.12/jars/graphframes.jar
==== central: tried
https://repo1.maven.org/maven2/graphframes/graphframes/0.8.1-spark3.0-s_2.12/graphframes-0.8.1-spark3.0-s_2.12.pom
-- artifact graphframes#graphframes;0.8.1-spark3.0-s_2.12!graphframes.jar:
https://repo1.maven.org/maven2/graphframes/graphframes/0.8.1-spark3.0-s_2.12/graphframes-0.8.1-spark3.0-s_2.12.jar
==== spark-packages: tried
https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.8.1-spark3.0-s_2.12/graphframes-0.8.1-spark3.0-s_2.12.pom
-- artifact graphframes#graphframes;0.8.1-spark3.0-s_2.12!graphframes.jar:
https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.8.1-spark3.0-s_2.12/graphframes-0.8.1-spark3.0-s_2.12.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: graphframes#graphframes;0.8.1-spark3.0-s_2.12: not found
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: graphframes#graphframes;0.8.1-spark3.0-s_2.12: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1389)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "/root/spark-3.0.2-bin-hadoop3.2/python/pyspark/shell.py", line 38, in <module>
SparkContext._ensure_initialized()
File "/root/spark-3.0.2-bin-hadoop3.2/python/pyspark/context.py", line 327, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/root/spark-3.0.2-bin-hadoop3.2/python/pyspark/java_gateway.py", line 105, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
>>> quit()
root@hpcc:~#
我使用的是 Ubuntu 操作系统 18.04.5 LTS
JDK版本为11.0.11
Scala 版本是 2.12.13
Spark-shel 版本为 3.0.2
我想知道是什么问题?我该如何克服这个问题?
【问题讨论】:
也许你可以从spark-packages.org/package/graphframes/graphframes 下载一个.jar 到你的本地仓库?或使用--repositories
将https://spark-packages.org/
添加到常春藤网址。
【参考方案1】:
jar 必须从repos.spark-packages.org
下载。不幸的是,当使用 --packages
参数时,pyspark
没有检查这个 repo。如果您的机器有一个正在运行的 Maven 安装可用,解决问题的最简单方法是手动将 jar 下载到本地 Maven 存储库:
mvn org.apache.maven.plugins:maven-dependency-plugin:2.1:get
-Dartifact=graphframes:graphframes:0.8.1-spark3.0-s_2.12
-DrepoUrl=https://repos.spark-packages.org
此命令会将 jar(以及所有必需的依赖项,如果有)下载到您的本地 Maven 存储库/root/.m2/repository
。从这个位置pyspark
可以拿起罐子。
【讨论】:
以上是关于使用 spark-shell 安装包 Graphframes的主要内容,如果未能解决你的问题,请参考以下文章
如何使用 spark-avro 包从 spark-shell 读取 avro 文件?