Jupyter中的PySpark SparkContext名称错误'sc'

Posted 2023-04-15

技术标签:

【中文标题】Jupyter中的PySpark SparkContext名称错误\'sc\'【英文标题】：PySpark SparkContext Name Error 'sc' in jupyterJupyter中的PySpark SparkContext名称错误'sc' 【发布时间】：2016-04-22 17:07:20 【问题描述】：

我是 pyspark 的新手，想在我的 Ubuntu 12.04 机器上通过 Ipython notebook 使用 pyspark。下面是pyspark和Ipython notebook的配置。

sparkuser@Ideapad:~$ echo $JAVA_HOME
/usr/lib/jvm/java-8-oracle

# Path for Spark
sparkuser@Ideapad:~$ ls /home/sparkuser/spark/
bin    CHANGES.txt  data  examples  LICENSE   NOTICE  R          RELEASE  scala-2.11.6.deb
build  conf         ec2   lib       licenses  python  README.md  sbin     spark-1.5.2-bin-hadoop2.6.tgz

我安装了 Anaconda2 4.0.0 和 anaconda 的路径：

sparkuser@Ideapad:~$ ls anaconda2/
bin  conda-meta  envs  etc  Examples  imports  include  lib  LICENSE.txt  mkspecs  pkgs  plugins  share  ssl  tests

为 IPython 创建 PySpark 配置文件。

ipython profile create pyspark

sparkuser@Ideapad:~$ cat .bashrc

export SPARK_HOME="$HOME/spark"
export PYSPARK_SUBMIT_ARGS="--master local[2]"
# added by Anaconda2 4.0.0 installer
export PATH="/home/sparkuser/anaconda2/bin:$PATH"

创建一个名为 ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py 的文件：

sparkuser@Ideapad:~$ cat .ipython/profile_pyspark/startup/00-pyspark-setup.py 
import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))

spark_release_file = spark_home + "/RELEASE"

if os.path.exists(spark_release_file) and "Spark 1.5.2" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: 
        pyspark_submit_args += " pyspark-shell"
        os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

登录pyspark终端：

sparkuser@Ideapad:~$ ~/spark/bin/pyspark
Python 2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Dec  6 2015, 18:08:32) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/04/22 21:06:55 INFO SparkContext: Running Spark version 1.5.2
16/04/22 21:07:27 INFO BlockManagerMaster: Registered BlockManager
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.
>>> sc
<pyspark.context.SparkContext object at 0x7facb75b50d0>
>>>

当我运行以下命令时，会打开一个 juypter 浏览器

sparkuser@Ideapad:~$ ipython notebook --profile=pyspark
[TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and will be removed in future versions.
[TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... continue in 5 sec. Press Ctrl-C to quit now.
[W 21:32:08.070 NotebookApp] Unrecognized alias: '--profile=pyspark', it will probably have no effect.
[I 21:32:08.111 NotebookApp] Serving notebooks from local directory: /home/sparkuser
[I 21:32:08.111 NotebookApp] 0 active kernels 
[I 21:32:08.111 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/
[I 21:32:08.111 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Created new window in existing browser session.

如果我在浏览器中输入以下命令，则会抛出 NameError。

In [ ]: print sc
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-ee8101b8fe58> in <module>()
----> 1 print sc
NameError: name 'sc' is not defined

当我在 pyspark 终端中运行上述命令时，它正在输出所需的输出，但是当我在 jupyter 中运行相同的命令时，它会抛出上述错误。

以上是pyspark和Ipython的配置设置。如何用jupyter配置pyspark？

【问题讨论】：

【参考方案1】：

这是一种解决方法，我建议您尝试不依赖 pyspark 为您加载上下文：-

从

安装 findspark python 包

pip install findspark

如果您使用 Anaconda 安装了 Jupyter Notebook，请改用 Anaconda 提示符或终端：

 $CONDA_PYTHON_EXE -m pip install findspark

然后简单地导入并初始化sparkcontext：-

import findspark
findspark.init()
import os

import pyspark # import pyspark only after findspark

print(sc)
print(spark)

参考：https://pypi.python.org/pypi/findspark

【讨论】：

【参考方案2】：

您好，您需要在终端中试用 pyspark 内核：

mkdir -p ~/.ipython/kernels/pyspark

nano ~/.ipython/kernels/pyspark/kernel.json

然后复制以下文本：

 'display_name': 'pySpark (Spark 1.6.1)', 
'language': 'python', 
'argv': [ 
    '/usr/bin/python', // Your python Path
    '-m', 'IPython.kernel', 
    '--profile=pyspark', 
    '-f', 
    'connection_file' 
]

并保存 (ctr + X, y)

现在您的 jupyter 内核中应该有“pyspark”。

现在要么 sc 已经存在于你的笔记本中（尝试在单元格中调用 sc），否则尝试运行这些行：

import pyspark
conf = (pyspark.SparkConf().setAppName('test').set("spark.executor.memory", "2g").setMaster("local[2]"))
sc = pyspark.SparkContext(conf=conf)

你现在应该让你的 sc 运行

【讨论】：

【参考方案3】：

简单的建议是不要使 pyspark 安装复杂化。

版本> 2.2，你可以做一个简单的pip install pyspark来安装pyspark包。此外，如果您还想安装 jupyter，请为 jupyter 执行另一个 pip install。 pip install pyspark pip install jupyter

或者，如果您想使用其他版本或特定发行版的 spark，早期的 3 minute 方法将是： https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f

【讨论】：

以上是关于Jupyter中的PySpark SparkContext名称错误'sc'的主要内容，如果未能解决你的问题，请参考以下文章

PySpark 和 Jupyter-notebook 中的 Collect() 错误

Jupyter Notebook 中的 PySpark 配置

无法访问 EMR 集群 jupyter notebook 中的 pyspark

Pyspark：Jupyter Notebook 中的 spark 数据框列宽配置

jupyter notebook 怎么跑pyspark

如何从 Jupyter Notebook 中的 PySpark 远程连接到 Greenplum 数据库？