将 PySpark 与 Jupyter Notebook 集成

Posted

技术标签:

【中文标题】将 PySpark 与 Jupyter Notebook 集成【英文标题】:Integrate PySpark with Jupyter Notebook 【发布时间】:2016-08-25 15:43:37 【问题描述】:

我正在关注这个site 来安装 Jupyter Notebook、PySpark 并集成两者。

当我需要创建“Jupyter 配置文件”时,我读到“Jupyter 配置文件”不再存在。所以我继续执行以下几行。

$ mkdir -p ~/.ipython/kernels/pyspark

$ touch ~/.ipython/kernels/pyspark/kernel.json

我打开kernel.json并写下以下内容:


 "display_name": "pySpark",
 "language": "python",
 "argv": [
  "/usr/bin/python",
  "-m",
  "IPython.kernel",
  "-f",
  "connection_file"
 ],
 "env": 
  "SPARK_HOME": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7",
  "PYTHONPATH": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python:/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip",
  "PYTHONSTARTUP": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "pyspark-shell"
 

Spark 的路径是正确的。

但是,当我运行 jupyter console --kernel pyspark 时,我得到了这个输出:

MacBook:~ Agus$ jupyter console --kernel pyspark
/usr/bin/python: No module named IPython
Traceback (most recent call last):
  File "/usr/local/bin/jupyter-console", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/site-packages/jupyter_core/application.py", line 267, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 595, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-113>", line 2, in initialize
  File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 74, in catch_config_error
    return method(app, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 137, in initialize
    self.init_shell()
  File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 110, in init_shell
    client=self.kernel_client,
  File "/usr/local/lib/python2.7/site-packages/traitlets/config/configurable.py", line 412, in instance
    inst = cls(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 251, in __init__
    self.init_kernel_info()
  File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 305, in init_kernel_info
    raise RuntimeError("Kernel didn't respond to kernel_info_request")
RuntimeError: Kernel didn't respond to kernel_info_request

【问题讨论】:

【参考方案1】:

将 pyspark 与 jupyter notebook 集成的多种方式。 1.安装Apache Toree

  pip install jupyter
  pip install toree
  jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark

你可以通过

检查安装
 jupyter kernelspec list

您将获得 toree pyspark 内核的条目

  apache_toree_pyspark    /home/pauli/.local/share/jupyter/kernels/apache_toree_pyspark

之后,如果您愿意,可以安装其他解释器,如 SparkR、Scala、SQL

 jupyter toree install --interpreters=Scala,SparkR,SQL

2.将这些行添加到 bashrc

  export SPARK_HOME=/path to /spark-2.2.0
  export PATH="$PATH:$SPARK_HOME/bin"    
  export PYSPARK_DRIVER_PYTHON=jupyter
  export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

在终端中输入pyspark,它将打开一个初始化了sparkcontext的jupyter笔记本。

    仅将pyspark 安装为python packagepip install pyspark

    现在您可以像导入另一个 python 包一样导入 pyspark。

【讨论】:

【参考方案2】:

最简单的方法是使用 findspark。首先创建一个环境变量:

export SPARK_HOME="full path to Spark"

然后安装 findspark:

pip install findspark

然后启动 jupyter notebook 并且以下应该可以工作:

import findspark
findspark.init()

import pyspark

【讨论】:

以上是关于将 PySpark 与 Jupyter Notebook 集成的主要内容,如果未能解决你的问题,请参考以下文章

数据库的 Pyspark/jupyter 笔记本显示问题

python / pyspark 版本的 Jupyter 问题

在 AWS EMR v4.0.0 上使用 Pyspark 配置 Ipython/Jupyter 笔记本

从 jupyter-notebook 下载 HTML 文件到本地

设置 jupyter notebook 可远程访问

Conda 搭建jupyter notebook + pyspark