在anaconda环境下搭建python3.5 + jupyter sparkR,scala,pyspark
Posted ljtyxl
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了在anaconda环境下搭建python3.5 + jupyter sparkR,scala,pyspark相关的知识,希望对你有一定的参考价值。
在anaconda环境下搭建python3.5 + jupyter sparkR,scala,pyspark
多用户jupyterhub+kubernetes 认证:https://my.oschina.net/u/2306127/blog/1837196
https://ojerk.cn/Ubuntu%E4%B8%8B%E5%A4%9A%E7%94%A8%E6%88%B7%E7%89%88jupyterhub%E9%83%A8%E7%BD%B2/
ubuntu16.4
curl -O https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
python3.5.2
We can now verify the data integrity of the installer with cryptographic hash verification through the SHA-256 checksum. We’ll use the sha256sum
command along with the filename of the script:
sha256sum Anaconda3-4.4.0-Linux-x86_64.sh
You’ll receive output that looks similar to this:
73b51715a12b6382dd4df3dd1905b531bd6792d4aa7273b2377a0436d45f0e78 Anaconda3-4.2.0-Linux-x86_64.sh
You should check the output against the hashes available at the Anaconda with Python 3 on 64-bit Linux page for your appropriate Anaconda version. As long as your output matches the hash displayed in the sha2561
row then you’re good to go.
Now we can run the script:
bash Anaconda3-4.2.0-Linux-x86_64.sh
Anaconda3 will now be installed into this location:
/opt/anaconda3
[/opt/anaconda3] >>>
PATH=/opt/anaconda3/bin
conda create -n jupyter_py352_env python=3.5.2
jupyterhub单服务器多用户模式安装
首先安装python3以上版本。
source activate jupyter_env35
执行以下命令
sudo apt-get install gcc
sudo apt-get install openssl
sudo apt-get install libssl-dev
centos:
sudo yum install openssl-devel
# Installing Python3 (dependency of jupyterhub is on python3)
$ sudo apt-get install python3-pip
# 安装最新版本的npm/nodejs-legacy
sudo apt-get install npm nodejs-legacy
1.2 nodejs 安装
nodejs
和 npm
的安装:
apt install nodejs-legacy
apt install npm
更新(推荐更新到新版,apt 安装的版本太旧,会导致很多错误):
npm install -g n # 安装
n stable # 更新 nodejs
npm install -g npm
# 安装hub和代理
conda install jupyterhub
npm install -g configurable-http-proxy
# needed if running the notebook servers locally
pip install jupyter-conda
conda install notebook
判断是否安装成功:
jupyterhub -h
configurable-http-proxy -h
jupyterhub --no-ssl
增加用户用于登录:
添加用户和组
- groupadd jupyter_usergroup
- sudo useradd -c "jupyter user test" -g jupyter_usergroup -d /home/jupyter_user2 jupyter_user2 -m
- ls /home/
-
useradd jupyter_user2
passwd jupyter_user2 123456
例:将用户 user1 加入到 users组中,
usermod –g jupyter_user2 jupyter_usergroup - 查看jupyterhub用户组的所有用户:
GID=`grep 'jupyterhub' /etc/group|awk -F':' 'print $3'`
awk -F":" 'print $1"\\t"$4' /etc/passwd |grep $GID
- passwd 用户名 修改某个用户的密码[root用户]
运行以下命令生成配置文件
jupyterhub --generate-config
修改配置文件:
## The number of threads to allocate for encryption
#c.CryptKeeper.n_threads =8
c.JupyterHub.ip = '0.0.0.0'
c.JupyterHub.port = 8000
c.PAMAuthenticator.encoding = 'utf-8'
c.LocalAuthenticator.create_system_users = True
c.LocalAuthenticator.group_whitelist = 'jupyterhub'
c.Authenticator.whitelist = 'ubuntu','jupyter_user1','jupyter_user2','test01'
c.JupyterHub.admin_users = 'ubuntu'
c.JupyterHub.statsd_prefix = 'jupyterhub'
增加白名单及管理员用户
登录 serverip:8000
安装jupyterLab
参考:云服务器搭建神器JupyterLab(多图)_D介子的博客-CSDN博客
(ljt_env) ubuntu@node1:~/ljt_test$ python
Python 3.5.4 |Anaconda, Inc.| (default, Feb 19 2018, 10:59:04)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from notebook.auth import passwd
>>> passwd()
Enter password:
Verify password:
'sha1:25f46ecf43f0:8f778092033e870fec6718189eaeba118aec807a'
vim /home/ubuntu/.jupyter/jupyter_notebook_config.py
在文件末尾添加
c.NotebookApp.allow_root = True
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.notebook_dir = u'/home/ubuntu/ljt_test/jupyterhubHome'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u' sha1:25f46ecf43f0:8f778092033e870fec6718189eaeba118aec807a '
c.NotebookApp.port = 8000
(ljt_env) ubuntu@node1:~/ljt_test$ jupyter-lab --version
0.34.9
开启 jupyterhub
1.激活python3.5 虚拟环境
cd /home/ubuntu/ljt_test
source activate ljt_env
2. 检查jupyterhub 服务
ps -ef|grep jupyterhub
lsof -i:8000
kill -9 45654
tail -fn200 /home/ubuntu/ljt_test/jupyterhub.log
3.若没有开启jupyter服务,开启服务
nohup jupyterhub --no-ssl > jupyterhub.log &
nohup jupyterhub -f /etc/jupyterhub_config.py --no-ssl > jupyterhub.log &
nohup jupyterhub -f /home/ubuntu/ljt_test/jupyterhub_config.py --ssl-key /home/ubuntu/ljt_test/mykey.key --ssl-cert /home/ubuntu/ljt_test/mycert.pem > jupyterhub.log &
4、测试访问
用IP+端口测试访问
5、用户管理
用户白名单的用户会自动添加,但无密码,需要修改密码才能登录;
新添加用户:sudo useradd jupyter_user2 -d /home/ubuntu/jupyter_user2 -m
用户添加组:sudo adduser jupyter_user2 jupyterhub
修改用户密码:echo crxis:crxis|chpasswd
问题:
PAM Authentication failed (test01@222.180.208.234): [PAM Error 7] Authentication failure
验证器说明
PAMAuthenticator | 默认,内置身份验证器 |
OAuthenticator | OAuth + JupyterHub身份验证器= OAuthenticator |
LdapAuthenticator | 用于JupyterHub的简单LDAP身份验证器插件 |
kdcAuthenticator | JupyterHub的Kerberos身份验证器插件 |
https://blog.chairco.me/posts/2018/06/how%20to%20build%20a%20jupytre-hub%20for%20team.html
注意:我的服务器登陆后默认就是管理员,所以下面过程都是在管理员root身份下安装的,如果当前用户不是管理员那会出现乱七八糟的问题(猜测是因为认证的问题,而上面那个博客里详细记录了認證方式(PAM)的配置过程。可以试试)
Quickstart — JupyterHub documentation
如何在非安全的CDH集群中部署多用户JupyterHub服务并集成Spark2 - 腾讯云开发者社区-腾讯云
JupyterHub与OpenLDAP集成 - 腾讯云开发者社区-腾讯云
- jupyterhub -f ./jupyterhub_config.py --ssl-key ./mykey.key --ssl-cert ./mycert.pem
参考:
How To Install the Anaconda Python Distribution on Ubuntu 16.04 | DigitalOcean
Installation of Jupyterhub on remote server · jupyterhub/jupyterhub Wiki · GitHub
JupyterHub的安装与配置——让Jupyter支持多用户 - crxis - 博客园
[远程使用Jupyter]本地使用服务器端运行的Jupyter Notebook_Papageno2018的博客-CSDN博客
云服务器搭建神器JupyterLab(多图)_D介子的博客-CSDN博客
记一次在服务器上配置 Jupyterhub 作为系统服务
https://blog.huzicheng.com/2018/01/04/jupyter-as-a-service/
配置 jupyterhub
make Pyspark working inside jupyterhub
conda install -c conda-forge pyspark==2.2.0
vim /etc/profile
export JAVA_HOME=/usr/java/jdk1.8.0_181 export JRE_HOME=$JAVA_HOME/jre export CLASSPATH=.:$JAVA_HOME/lib export PATH=$PATH:$HADOOP_HOME/bin export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark export PATH=$SPARK_HOME/bin:$PATH export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH |
source /etc/profile
How to integrate JupyterHub with the existing Cloudera cluster
export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='notebook' |
Vim /home/ubuntu/anaconda3/share/jupyter/kernels/python3/kernel.json
"argv": ["/home/ljt/anaconda3/bin/python", "-m", "ipykernel_launcher", "-f", "connection_file"],
"display_name": "Python 3.6+Pyspark2.4.3",
"language": "python",
"env":
"HADOOP_CONF_DIR": "/mnt/e/hadoop/3.1.1/conf",
"PYSPARK_PYTHON": "/home/ljt/anaconda3/bin/python",
"SPARK_HOME": "/mnt/f/spark/2.4.3",
"WRAPPED_SPARK_HOME": "/etc/spark",
"PYTHONPATH": "/mnt/f/spark/2.4.3/python/:/mnt/f/spark/2.4.3/python/lib/py4j*src.zip",
"PYTHONSTARTUP": "/mnt/f/spark/2.4.3/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master yarn --deploy-mode client pyspark-shell"
jupyterhub -f jupyterhub_config.py
nohup jupyterhub -f /home/ubuntu/ljt_test/jupyterhub_config.py --no-ssl > jupyterhub.log &
nohup jupyter lab --config=/home/ubuntu/.jupyter/jupyter_notebook_config.py > jupyter.log &
Scala
https://medium.com/@bogdan.cojocar/how-to-run-scala-and-spark-in-the-jupyter-notebook-328a80090b3b
Step1: install the package
pip install spylon-kernel
Step2: create a kernel spec
This will allow us to select the scala kernel in the notebook.
python -m spylon_kernel install
Step3: start the jupyter notebook
ipython notebook
And in the notebook we select New -> spylon-kernel
. This will start our scala kernel.
Step4: testing the notebook
Let’s write some scala code:
val x = 2
val y = 3
x+y
r kernel
conda install -c r r-irkernel
conda install -c r r-essentials
conda create -n my-r-env -c r r-essentials
GitHub - sparklyr/sparklyr: R interface for Apache Spark
SparkR (R on Spark) - Spark 2.2.0 Documentation
export LD_LIBRARY_PATH="/usr/java/jdk1.8.0_181/jre/lib/amd64/server"
rJAVA
Installing RJava (Ubuntu) · hannarud/r-best-practices Wiki · GitHub
java 环境需要在/user/lib/jvm 下面
SparkR on Yarn 安装配制
conda create -p /home/ubuntu/anaconda3/envs/r_env --copy -y -q r-essentials -c r
sparkR --executor-memory 2g --total-executor-cores 10 --master spark://node1.sc.com:7077
Sys.setenv(SPARK_HOME=' /opt/cloudera/parcels/CDH/lib/spark')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
library(SparkR)
sc <- sparkR.init(master='yarn-client', sparkPackages="com.databricks:spark-csv_2.11:1.5.0")
sqlContext <- sparkRSQL.init(sc)
df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
head(df)
configurable-http-proxy --ip **.*.6.110 --port 8000 --log-level=debug
openssl rand -hex 32
0b042f8d651fb8126537d1ec98507b093653d1ffe4b909f053616062184b1db3
Sparkrly:
sparklyr - Configuring Spark Connections
sparklyr: a test drive on YARN | R-bloggers
Ubuntu 下安装sparklyr 并连接远程spark集群_The_One_is_all的博客-CSDN博客
# .R script showing capabilities of sparklyr R package
# Prerequisites before running this R script:
# Ubuntu 16.04.3 LTS 64-bit, r-base (version 3.4.1 or newer),
# RStudio 64-bit version, libssl-dev, libcurl4-openssl-dev, libxml2-dev
install.packages("httr")
install.packages("xml2")
# New features in sparklyr 0.6:
# https://blog.rstudio.com/2017/07/31/sparklyr-0-6/
install.packages("sparklyr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
library(sparklyr)
library(dplyr)
library(ggplot2)
library(tidyr)
set.seed(100)
# sparklyr cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/sparklyr.pdf
# dplyr+tidyr: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
# sparklyr currently (2017-08-19) only supports Apache Spark version 2.2.0 or older
# Install Spark locally:
sc_version <- "2.2.0"
spark_install(sc_version)
config <- spark_config()
# number of CPU cores to use:
config$spark.executor.cores <- 6
# amount of RAM to use for Apache Spark executors:
config$spark.executor.memory <- "4G"
# Connect to local version:
sc <- spark_connect (master = "local",
config = config, version = sc_version)
# Copy data to Spark memory:
import_iris <- sdf_copy_to(sc, iris, "spark_iris", overwrite = TRUE)
# partition data:
partition_iris <- sdf_partition(import_iris,training=0.5, testing=0.5)
# Create a hive metadata for each partition:
sdf_register(partition_iris,c("spark_iris_training","spark_iris_test"))
# Create reference to training data in Spark table
tidy_iris <- tbl(sc,"spark_iris_training") %>% select(Species, Petal_Length, Petal_Width)
# Spark ML Decision Tree Model
model_iris <- tidy_iris %>% ml_decision_tree(response="Species", features=c("Petal_Length","Petal_Width"))
# Create reference to test data in Spark table
test_iris <- tbl(sc,"spark_iris_test")
# Bring predictions data back into R memory for plotting:
pred_iris <- sdf_predict(model_iris, test_iris) %>% collect
pred_iris %>%
inner_join(data.frame(prediction=0:2,
lab=model_iris$model.parameters$labels)) %>%
ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
geom_point()
spark_disconnect(sc)
以上是关于在anaconda环境下搭建python3.5 + jupyter sparkR,scala,pyspark的主要内容,如果未能解决你的问题,请参考以下文章
深度学习环境搭建:Tensorflow1.4.0+Ubuntu16.04+Python3.5+Cuda8.0+Cudnn6.0