如何在Apache Spark Java中将数组类型的数据集转换为字符串类型
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何在Apache Spark Java中将数组类型的数据集转换为字符串类型相关的知识,希望对你有一定的参考价值。
我的数据集中有一个数组类型需要转换为字符串类型。我以传统的方式尝试过。我觉得我们可以更好地做到这一点。你能指导我吗?输入数据集1
+---------------------------+-----------+-------------------------------------------------------------------------------------------------+
ManufacturerSource |upcSource |productDescriptionSource | |
+---------------------------+-----------+-------------------------------------------------------------------------------------------------+
|3M |51115665883|[c, gdg, whl, t27, 5, x, 1, 4, x, 7, 8, grindig, flap, wheels, 36, grit, 12, 250, rpm] | |
|3M |51115665937|[c, gdg, whl, t27, q, c, 6, x, 1, 4, x, 5, 8, 11, grinding, flap, wheels, 36, grit, 10, 200, rpm]| |
|3M |0 |[3mite, rb, cloth, 3, x, 2, wd] | |
|3M |0 |[trizact, disc, cloth, 237aaa16x5, hole] | |
-------------------------------------------------------------------------------------------------------------------------------------------
预期的输出数据集
+---------------------------+-----------+---------------------------------------------------------------------------------------------------|
|ManufacturerSource |upcSource |productDescriptionSource | |
+---------------------------+-----------+---------------------------------------------------------------------------------------------------|
|3M |51115665883|c gdg whl t27 5 x 1 4 x 7 8 grinding flap wheels 36 grit 12 250 rpm | | |
|3M |51115665937|c gdg whl t27 q c 6 x 1 4 x 5 8 11 grinding flap wheels 36 grit 10 200 rpm | |
|3M |0 |3mite rb cloth 3 x 2 wd | |
|3M |0 |trizact disc cloth 237aaa16x5 hole | |
+-------------------------------------------------------------------------------------------------------------------------------------------|
传统方法1
Dataset<Row> afterstopwordsRemoved =
stopwordsRemoved.select("productDescriptionSource");
stopwordsRemoved.show();
List<Row> individaulRows= afterstopwordsRemoved.collectAsList();
System.out.println("After flatmap
");
List<String> temp;
for(Row individaulRow:individaulRows){
temp=individaulRow.getList(0);
System.out.println(String.join(" ",temp));
}
方法2(不产生结果)
例外:无法执行用户定义的函数($ anonfun $ 27:(array)=> string)
UDF1 untoken = new UDF1<String,String[]>() {
public String call(String[] token) throws Exception {
//return types.replaceAll("[^a-zA-Z0-9\s+]", "");
return Arrays.toString(token);
}
@Override
public String[] call(String t1) throws Exception {
// TODO Auto-generated method stub
return null;
}
};
sqlContext.udf().register("unTokenize", untoken, DataTypes.StringType);
source.createOrReplaceTempView("DataSetOfTokenize");
Dataset<Row> newDF = sqlContext.sql("select *,unTokenize(productDescriptionSource)FROM DataSetOfTokenize");
newDF.show(4000,false);
答案
我会用concat_ws
:
sqlContext.sql("select *, concat_ws(' ', productDescriptionSource) FROM DataSetOfTokenize");
要么:
import static org.apache.spark.sql.functions.*;
df.withColumn("foo" ,concat_ws(" ", col("productDescriptionSource")));
以上是关于如何在Apache Spark Java中将数组类型的数据集转换为字符串类型的主要内容,如果未能解决你的问题,请参考以下文章
如何在 Spark Submit 中将 s3a 与 Apache spark 2.2(hadoop 2.8) 一起使用?
如何在 Scala 中将数据帧转换为 Apache Spark 中的数据集?
如何在 Bluemix 中将 SQL 数据库加载到 Analytics for Apache Spark?