使用 udf 的 pyspark 出错:您必须使用 Hive 构建 Spark。导出 'SPARK_HIVE=true' 并运行 build/sbt 程序集
Posted
技术标签:
【中文标题】使用 udf 的 pyspark 出错:您必须使用 Hive 构建 Spark。导出 \'SPARK_HIVE=true\' 并运行 build/sbt 程序集【英文标题】:Error in pyspark with udf: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly使用 udf 的 pyspark 出错:您必须使用 Hive 构建 Spark。导出 'SPARK_HIVE=true' 并运行 build/sbt 程序集 【发布时间】:2016-07-18 17:00:11 【问题描述】:我正在尝试使用 Pre-built for Hadoop 2.6 in Windows 7 Professional 在 Pyspark 1.6.2 中运行此代码
在我定义 udf 之前一切正常。能否指点一下。我需要用 hive 编译 Spark 吗?那么为Hadoop 2.6预建的有什么用。我不能更改 C:\tmp\hive 权限,因为我不是系统管理员。这可能是此错误消息的原因吗?
from pyspark import SparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
rdd = sc.parallelize([('u1', 1, [1 ,2, 3]), ('u1', 4, [1, 2, 3])])
df = rdd.toDF(['user', 'item', 'fav_items'])
# Print dataFrame
df.show()
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType
function = udf(lambda item, items: 1 if item in items else 0, IntegerType())
df.select('user', 'item', 'fav_items', function(col('item'), col('fav_items')).alias('result')).show()
然后我得到这个错误:
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\yrxt028\Downloads\spark-1.6.2-bin-hadoop2.6\spark-1.6.2-bin-hadoop2.6\python\pyspark\sql\functions.py", line 1597, in udf
return UserDefinedFunction(f, returnType)
File "C:\Users\yrxt028\Downloads\spark-1.6.2-bin-hadoop2.6\spark-1.6.2-bin-hadoop2.6\python\pyspark\sql\functions.py", line 1558, in __init__
self._judf = self._create_judf(name)
File "C:\Users\yrxt028\Downloads\spark-1.6.2-bin-hadoop2.6\spark-1.6.2-bin-hadoop2.6\python\pyspark\sql\functions.py", line 1569, in _create_judf
jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
File "C:\Users\yrxt028\Downloads\spark-1.6.2-bin-hadoop2.6\spark-1.6.2-bin-hadoop2.6\python\pyspark\sql\context.py", line 683, in _ssql_ctx
self._scala_HiveContext = self._get_hive_ctx()
File "C:\Users\yrxt028\Downloads\spark-1.6.2-bin-hadoop2.6\spark-1.6.2-bin-hadoop2.6\python\pyspark\sql\context.py", line 692, in _get_hive_ctx
return self._jvm.HiveContext(self._jsc.sc())
File "C:\Users\yrxt028\Downloads\spark-1.6.2-bin-hadoop2.6\spark-1.6.2-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", line 1064,
l__
File "C:\Users\yrxt028\Downloads\spark-1.6.2-bin-hadoop2.6\spark-1.6.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 45, in deco
return f(*a, **kw)
File "C:\Users\yrxt028\Downloads\spark-1.6.2-bin-hadoop2.6\spark-1.6.2-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in g
_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx--
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:204)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------
at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
... 21 more
【问题讨论】:
您是否单独安装了hive
?还是您刚刚下载了 spark 并使用了默认配置(hive)?
没有单独安装hive。我正在使用 hadoop 2.6 的预构建配置。我一直在我的mac上使用pyspark,这是第一次在Windows上处理udfs。
【参考方案1】:
您需要在$SPARK_HOME/conf
位置创建hive-site.xml
文件。在此文件中,您可以覆盖 scratch dir
路径。这些是您应该包含在hive-site.xml
文件中的重要配置,但如果您遇到其他错误,可以检查此link for other settings:
<!-- Hive Execution Parameters -->
<property>
<name>hadoop.tmp.dir</name>
<value>$test.tmp.dir/hadoop-tmp</value>
<description>A base for other temporary directories.</description>
</property>
<!--
<property>
<name>hive.exec.reducers.max</name>
<value>1</value>
<description>maximum number of reducers</description>
</property>
-->
<property>
<name>hive.exec.scratchdir</name>
<value>$test.tmp.dir/scratchdir</value>
<description>Scratch space for Hive jobs</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>$test.tmp.dir/localscratchdir/</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=$test.tmp.dir/junit_metastore_db;create=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property><property>
<!-- this should eventually be deprecated since the metastore should supply this -->
<name>hive.metastore.warehouse.dir</name>
<value>$test.warehouse.dir</value>
<description></description>
</property>
<property>
<name>hive.metastore.metadb.dir</name>
<value>file://$test.tmp.dir/metadb/</value>
<description>
Required by metastore server or if the uris argument below is not supplied
</description>
</property>
<property>
<name>test.log.dir</name>
<value>$test.tmp.dir/log/</value>
<description></description>
</property>
【讨论】:
【参考方案2】:在我的 hive-site.xml 中,我只有以下代码:
<configuration>
<property>
<name>hive.exec.scratchdir</name>
<value>$/tmp1/hive/scratchdir</value>
<description>Scratch space for Hive jobs</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>$/tmp1/hive/localscratchdir/</value>
<description>Local scratch space for Hive jobs</description>
</property>
</configuration>
这篇文章也帮助了我:
What's hive-site.xml including in $SPARK_HOME looks like?
因此,我创建了另一个文件夹 /tmp1/hive,而不是我无法更改其权限的 /tmp/hive。非常感谢@BigDataLearnner。还有另一种使用 spark-env.sh 覆盖目录的方法——我试过了,但没有用
spark-env.sh 的代码:
SPARK_LOCAL_DIRS=/tmp1/hive
SPARK_PID_DIR=/tmp1
【讨论】:
这对我没用,但@Ronak Patel 的扩展版对我有用。以上是关于使用 udf 的 pyspark 出错:您必须使用 Hive 构建 Spark。导出 'SPARK_HIVE=true' 并运行 build/sbt 程序集的主要内容,如果未能解决你的问题,请参考以下文章