如何使用PySpark将SparseVector中的前X个单词转换为字符串数组
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何使用PySpark将SparseVector中的前X个单词转换为字符串数组相关的知识,希望对你有一定的参考价值。
我目前正在集中一些文本文档。由于PySpark方法,我正在使用K-means并使用TF-IDF继续我的数据。现在我想获得每个集群的前10个单词:
当我做 :
getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))
predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) \
.withColumn("topWord", getTopwords_udf(col('means'))) \
.select("prediction", "topWord") \
.show(2, truncate=100)
我收到此错误:
Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
我认为这是因为类型(从DoubleType到浮动为numpy)所以我也试过这个看看发生了什么
vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))
predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) \
.withColumn("topWord", vector_udf(col('means'))) \
.withColumn("topWord2", vector2_udf(col('topWord'))) \
.select("prediction", "topWord", "topWord2") \
.show(2, truncate=100)
但我得到这个错误TypeError: 'NoneType' object is not subscriptable
答案
我已经想出如何使用PySpark将SparseVector中的前X个单词转换为字符串数组。这是我可能感兴趣的人的解决方案......
def getTopWordContainer(v):
def getTopWord(vector):
vectorConverted = vector.toArray().tolist()
listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
return [v[j] for j in listSortedDesc]
return getTopWord
getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))
top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) \
.withColumn("topWord", getTopWord_udf(col('means'))) \
.select("prediction", "topWord")
我是火花的初学者,所以如果你知道热的提升它,请告诉我:)
以上是关于如何使用PySpark将SparseVector中的前X个单词转换为字符串数组的主要内容,如果未能解决你的问题,请参考以下文章
PySpark 在 Dataframe 列中插入一个常量 SparseVector
将 UDF 应用于 spark 2.0 中的 SparseVector 列