java.util.UUID 的 Spark 数据集的不同行为
Posted
技术标签:
【中文标题】java.util.UUID 的 Spark 数据集的不同行为【英文标题】:Different behavior of Spark Dataset for java.util.UUID 【发布时间】:2016-08-26 17:15:18 【问题描述】:我正在使用 Spark 2.0.0 并使用 SparkSession
创建一个 Dataset
。当我在createDataFrame
方法中使用java.util.UUID
时,它工作正常。但是当我将java.util.UUID
作为Javabean 中的一个字段并且当我使用这个Javabean 创建数据集时,它给了我scala.MatchError
。请参阅下面的代码和控制台日志。谁能告诉我这里发生了什么以及如何在 Javabean 类中使用UUID
创建Dataset
。谢谢。
UUIDTest.java
public class UUIDTest
public static void main(String[] args)
SparkSession spark = SparkSession
.builder()
.appName("UUIDTest")
.config("spark.sql.warehouse.dir", "/file:C:/temp")
.master("local[2]")
.getOrCreate();
System.out.println("====> Create Dataset using UUID");
//Working
List<UUID> uuids = Arrays.asList(UUID.randomUUID(),UUID.randomUUID());
Dataset<Row> uuidSet = spark.createDataFrame(uuids, UUID.class);
uuidSet.show();
System.out.println("====> Create Dataset using UserUUID");
//Not Working
List<UserUUID> userUuids = Arrays.asList(new UserUUID(UUID.randomUUID()),new UserUUID(UUID.randomUUID()));
Dataset<Row> userUuidSet = spark.createDataFrame(userUuids, UserUUID.class);//Exception at this line
userUuidSet.show();
spark.stop();
UserUUID.java
public class UserUUID implements Serializable
private UUID uuid;
public UserUUID()
public UserUUID(UUID uuid)
this.uuid = uuid;
public UUID getUuid()
return uuid;
public void setUuid(UUID uuid)
this.uuid = uuid;
控制台输出
16/08/26 22:49:23 INFO SharedState: Warehouse path is '/file:C:/temp'.
====> Create Dataset using UUID
16/08/26 22:49:26 INFO CodeGenerator: Code generated in 248.230818 ms
16/08/26 22:49:26 INFO CodeGenerator: Code generated in 10.550477 ms
+--------------------+-------------------+
|leastSignificantBits|mostSignificantBits|
+--------------------+-------------------+
|-6786538026241948655|5045373365275148508|
|-9161219066266259673|6040751881536491488|
+--------------------+-------------------+
====> Create Dataset using UserUUID
Exception in thread "main" scala.MatchError: 4fa3941c-f312-4031-a61b-01f2acef751b (of class java.util.UUID)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373)
at com.UUIDTest.main(UUIDTest.java:30)
16/08/26 22:49:26 INFO SparkContext: Invoking stop() from shutdown hook
【问题讨论】:
我遇到了完全相同的问题。你有没有找到解决办法? 【参考方案1】:在面对这个问题时,经过多次尝试使其正常工作,我找到的唯一解决方案是使用 list<text>
而不是 list<uuid>
并在我想使用 UUID 时在 java 级别进行映射方法:UUID.fromString(uuidStr)
【讨论】:
以上是关于java.util.UUID 的 Spark 数据集的不同行为的主要内容,如果未能解决你的问题,请参考以下文章
java.util.UUID.randomUUID().toString() 长度
java.lang.ClassCastException:java.util.ArrayList 不能使用 cassandra 转换为 java.util.UUID 异常?
Spring Boot 应用程序中没有可用的“java.util.UUID”类型的限定 bean