是否可以使用Crealytics spark-excel软件包将具有ArrayType列的Spark数据框写入Excel?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了是否可以使用Crealytics spark-excel软件包将具有ArrayType列的Spark数据框写入Excel?相关的知识,希望对你有一定的参考价值。
我想从Spark数据框创建XLSX文件。我正在使用Crealytics的spark-excel软件包(https://github.com/crealytics/spark-excel)实现相同的目的。
这是我创建spark数据框的方式:
my_row = Row([1,2,3,4,5] ,True, decimal.Decimal(5.66) , float(10.111) , 5 , "some string" , datetime(2019, 10, 23, 8, 36, 49, tzinfo=tzlocal()))
my_schema = StructType([StructField("array_type", ArrayType(IntegerType()), True),
StructField("boolean_type", BooleanType(), True),
StructField("decimal_type", DecimalType(), True),
StructField("double_type", DoubleType(), True),
StructField("int_type", IntegerType(), True),
StructField("string_type", StringType(), True),
StructField("timestamp_type", TimestampType(), True)])
df = spark.createDataFrame([my_row] , my_schema)
这是我尝试从DF创建Excel文件的方式:
df.write.format("com.crealytics.spark.excel").option("header","true").option("timestampformat","yyyy-MM-dd hh:mm:ss").save("path_to_excel\\crealytics_output.xlsx")
这里是错误:
Py4JJavaError Traceback (most recent call last)
<ipython-input-14-83d86eb4a8f1> in <module>()
1 # df = df.drop("array_type")
----> 2 df.write.format("com.crealytics.spark.excel").option("header","true").option("timestampformat","yyyy-MM-dd hh:mm:ss").save("file:///C:\\temp_folder\\crealytics_array.xlsx")
c:\python27\lib\site-packages\pyspark\sql\readwriter.pyc in save(self, path, format, mode, partitionBy, **options)
736 self._jwrite.save()
737 else:
--> 738 self._jwrite.save(path)
739
740 @since(1.4)
c:\python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
c:\python27\lib\site-packages\pyspark\sql\utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
c:\python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling 012.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o278.save.
: scala.MatchError: WrappedArray(1, 2, 3, 4, 5) (of class scala.collection.mutable.WrappedArray$ofRef)
at com.crealytics.spark.excel.AreaDataLocator$class.toCell(DataLocator.scala:117)
at com.crealytics.spark.excel.CellRangeAddressDataLocator.toCell(DataLocator.scala:134)
at com.crealytics.spark.excel.AreaDataLocator$$anonfun$7$$anonfun$8.apply(DataLocator.scala:107)
at com.crealytics.spark.excel.AreaDataLocator$$anonfun$7$$anonfun$8.apply(DataLocator.scala:106)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.crealytics.spark.excel.AreaDataLocator$$anonfun$7.apply(DataLocator.scala:106)
at com.crealytics.spark.excel.AreaDataLocator$$anonfun$7.apply(DataLocator.scala:104)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
at scala.collection.AbstractIterator.toList(Iterator.scala:1334)
at com.crealytics.spark.excel.AreaDataLocator$class.toSheet(DataLocator.scala:110)
at com.crealytics.spark.excel.CellRangeAddressDataLocator.toSheet(DataLocator.scala:134)
at com.crealytics.spark.excel.ExcelFileSaver.sheet$1(ExcelFileSaver.scala:36)
at com.crealytics.spark.excel.ExcelFileSaver.writeToWorkbook$1(ExcelFileSaver.scala:40)
at com.crealytics.spark.excel.ExcelFileSaver.save(ExcelFileSaver.scala:48)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:60)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
这里是整个代码:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import IntegerType, StructField, StructType, StringType, LongType, ArrayType, BooleanType, \
TimestampType, DecimalType, DoubleType
from datetime import datetime
from dateutil.tz import tzlocal
import decimal
my_row = Row([1, 2, 3, 4, 5], True, decimal.Decimal(5.66), float(10.111), 5, "some string",
datetime(2019, 10, 23, 8, 36, 49, tzinfo=tzlocal()))
my_schema = StructType([StructField("array_type", ArrayType(IntegerType()), True),
StructField("boolean_type", BooleanType(), True),
StructField("decimal_type", DecimalType(), True),
StructField("double_type", DoubleType(), True),
StructField("int_type", IntegerType(), True),
StructField("string_type", StringType(), True),
StructField("timestamp_type", TimestampType(), True)])
spark = SparkSession.builder.master("local").appName("Testing Excel").getOrCreate()
df = spark.createDataFrame([my_row], my_schema)
df.show()
df.write.format("com.crealytics.spark.excel").option("header", "true").option("timestampformat",
"yyyy-MM-dd hh:mm:ss").save(
"path_to_excel\\crealytics_output.xlsx")
我将Spark DF转换为Pandas DF,并使用了Pandas的to_excel()
方法,它能够正确生成excel文件。我需要使用Crealytics的spark-excel软件包创建所需的excel文件的帮助。
其他详细信息:python版本:2.7.12Spark版本:2.4.4spark-excel版本:spark-excel_2.11:0.13.1
答案
嗯,一切看起来都不错。 ncamed'.append()'有一个选项,您可以尝试吗?所以...应该是这样的...
import org.apache.spark.sql._
val df: DataFrame = ???
df.write
.format("com.crealytics.spark.excel")
.option("header", "true")
.mode("append") // Optional, default: overwrite.
.save("path_to_excel\\crealytics_output.xlsx")
对您有用吗?
以上是关于是否可以使用Crealytics spark-excel软件包将具有ArrayType列的Spark数据框写入Excel?的主要内容,如果未能解决你的问题,请参考以下文章
spark.read.excel - 使用自定义架构时不读取所有 Excel 行
空值未写入“org.zuinnote.spark.office.excel”包