Java-spark广播变量序列化问题

Posted Ssc_Zcx

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Java-spark广播变量序列化问题相关的知识,希望对你有一定的参考价值。

1、问题现象

        1、代码

        


    SparkSession sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();
    JavaSparkContext javaSparkContext = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());

    Dataset<Row> labelDimensionTable = sparkSession.read().parquet(labelDimPath);
    Map<String, Long> labelNameToId = getNameToId(labelDimensionTable);
    Broadcast<Map<String, Long>> labelNameIdBroadcast = javaSparkContext.broadcast(labelNameToId);

    Map<String, Long> getNameToId(Dataset<Row> labelDimTable) 
        return  labelDimTable.javaRDD().mapToPair(
                new PairFunction() 
                    @Override
                    public Tuple2 call(Object object) throws Exception 
                        Row curRow = (Row) object;
                        Long labelId = curRow.getAs("label_id");
                        String labelTitle = curRow.getAs("label_title");

                        return Tuple2.apply(labelTitle, labelId);
                    
                
        ).collectAsMap();
     

        2、错误描述


20/09/09 18:23:00 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 4008, node-hadoop67.com, executor 3, partition 0, RACK_LOCAL, 8608 bytes)
20/09/09 18:23:00 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on node-hadoop67.com:23191 (size: 41.1 KB, free: 2.5 GB)
20/09/09 18:23:01 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on node-hadoop67.com:23191 (size: 33.5 KB, free: 2.5 GB)
20/09/09 18:23:02 INFO storage.BlockManagerInfo: Added broadcast_5_piece1 in memory on node-hadoop67.com:23191 (size: 698.1 KB, free: 2.5 GB)
20/09/09 18:23:02 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on node-hadoop67.com:23191 (size: 4.0 MB, free: 2.5 GB)
20/09/09 18:23:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 (TID 4008, node-hadoop67.com, executor 3): java.io.IOException: java.lang.UnsupportedOperationException
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1367)
	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
	at com.kk.search.user_profile.task.user_profile.UserLabelProfile$1.call(UserLabelProfile.java:157)
	at org.apache.spark.sql.Dataset$$anonfun$44.apply(Dataset.scala:2605)
	at org.apache.spark.sql.Dataset$$anonfun$44.apply(Dataset.scala:2605)
	at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
	at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:381)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException
	at java.util.AbstractMap.put(AbstractMap.java:209)
	at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:162)
	at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:39)
	at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
	at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:278)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$8.apply(TorrentBroadcast.scala:308)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1394)
	at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:309)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1360)
	... 29 more

20/09/09 18:23:02 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 5.0 (TID 4009, node-hadoop64.com, executor 7, partition 0, RACK_LOCAL, 8608 bytes)


2、问题原因

        因为序列化的问题,在使用java api的时候,如果broadcast的变量是使用line_RDD_2.collectAsMap()的方式产生的,那么被广播的类型就是Map, kryo 不知道真实的对象类型,所以就会采用AbstractMap来进行解析。

3、解决方案

        新建一个map,将line_RDD_2.collectAsMap()放入新建的map即可。

原来的代码为

    Map<String, Long> getNameToId(Dataset<Row> labelDimTable) 
        return  labelDimTable.javaRDD().mapToPair(
                new PairFunction() 
                    @Override
                    public Tuple2 call(Object object) throws Exception 
                        Row curRow = (Row) object;
                        Long labelId = curRow.getAs("label_id");
                        String labelTitle = curRow.getAs("label_title");

                        return Tuple2.apply(labelTitle, labelId);
                    
                
        ).collectAsMap();
     

修改为


    Map<String, Long> getNameToId(Dataset<Row> labelDimTable) 

        Map<String, Long> res = new HashMap<>();
        Map<String, Long> apiMap=  labelDimTable.javaRDD().mapToPair(
                new PairFunction() 
                    @Override
                    public Tuple2 call(Object object) throws Exception 
                        Row curRow = (Row) object;
                        Long labelId = curRow.getAs("label_id");
                        String labelTitle = curRow.getAs("label_title");

                        return Tuple2.apply(labelTitle, labelId);
                    
                
        ).collectAsMap();
        res.putAll(apiMap);
        return res;
    



参考链接:

https://stackoverflow.com/questions/43023961/spark-kryo-serializers-and-broadcastmapobject-iterablegowalladatalocation

java-spark的各种常用算子的写法

    通常写spark的程序用scala比较方便,毕竟spark的源码就是用scala写的。然而,目前java开发者特别多,尤其进行数据对接、上线服务的时候,这时候,就需要掌握一些spark在java中的使用方法了

   一、map

     map在进行数据处理、转换的时候,不能更常用了

     在使用map之前 首先要定义一个转换的函数 格式如下:

   

Function<String, LabeledPoint> transForm = new Function<String, LabeledPoint>() {//String是某一行的输入类型 LabeledPoint是转换后的输出类型
            @Override
            public LabeledPoint call(String row) throws Exception {//重写call方法
                String[] rowArr = row.split(",");
                int rowSize = rowArr.length;

                double[] doubleArr = new double[rowSize-1];

                //除了第一位的lable外 其余的部分解析成double 然后放到数组中
                for (int i = 1; i < rowSize; i++) {
                    String each = rowArr[i];
                    doubleArr[i] = Double.parseDouble(each);
                }

                //用刚才得到的数据 转成向量
                Vector feature = Vectors.dense(doubleArr);
                double label = Double.parseDouble(rowArr[0]);
                //构造用于分类训练的数据格式 LabelPoint
                LabeledPoint point = new LabeledPoint(label, feature);
                return point;
            }
        };

  需要特别注意的是:

      1、call方法的输入应该是转换之前的数据行的类型  返回值应是处理之后的数据行类型

      2、如果转换方法中调用了自定义的类,注意该类名必须实现序列化 比如

public class TreeEnsemble implements Serializable {
}

  3、转换函数中如果调用了某些类的对象,比如该方法需要调用外部的一个参数,或者数值处理模型(标准化,归一化等),则该对象需要声明是final

      然后就是在合适的时候调用该转换函数了

      

JavaRDD<LabeledPoint> rdd = oriData.toJavaRDD().map(transForm);

  这种方式是需要将普通的rdd转成javaRDD才能使用的,转成javaRDD的这一步操作不耗时,不用担心

   二、filter

    在避免数据出现空值、0等场景中也非常常用,可以满足sql中where的功能

    这里首先也是要定义一个函数,该函数给定数据行 返回布尔值 实际效果是将返回为true的数据保留

Function<String, Boolean> boolFilter = new Function<String, Boolean>() {//String是某一行的输入类型 Boolean是对应的输出类型 用于判断数据是否保留
            @Override
            public Boolean call(String row) throws Exception {//重写call方法
                boolean flag = row!=null;
                return flag;
            }
        };

  通常该函数实际使用中需要修改的仅仅是row的类型 也就是数据行的输入类型,和上面的转换函数不同,此call方法的返回值应是固定为Boolean

    然后是调用方式

    

JavaRDD<LabeledPoint> rdd = oriData.toJavaRDD().filter(boolFilter);

  

    三、mapToPair

    该方法和map方法有一些类似,也是对数据进行一些转换。不过此函数输入一行 输出的是一个元组,最常用的方法是用来做交叉验证 或者统计错误率 召回率 计算AUC等等

    同样,需要先定义一个转换函数

Function<String, Boolean> transformer = new PairFunction<LabeledPoint, Object, Object>() {//LabeledPoint是输入类型 后面的两个Object不要改动
            @Override
            public Tuple2 call(LabeledPoint row) throws Exception {//重写call方法 通常只改动输入参数 输出不要改动
                double predicton = thismodel.predict(row.features());
                double label = row.label();
                return new Tuple2(predicton, label);
            }
        });

  关于调用的类、类的对象,要求和之前的一致,类需要实现序列化,类的对象需要声明成final类型

     相应的调用如下:

JavaPairRDD<Object, Object> predictionsAndLabels = oriData.mapToPair(transformer);

  然后对该predictionsAndLabels的使用,计算准确率、召回率、精准率、AUC,接下来的博客中会有,敬请期待

      如有补充,或者质疑,或者有相关问题,请发邮件给我,或者直接回复  邮箱:[email protected]

     

    

 

 

 

以上是关于Java-spark广播变量序列化问题的主要内容,如果未能解决你的问题,请参考以下文章

java-spark的各种常用算子的写法

Java-Spark:如何在循环中迭代时获取 Dataset<Row> 列的值并在 when().otherwise() 中使用它?

火花流中的广播变量空指针异常

spark学习笔记——sparkcore核心编程-RDD序列化/依赖关系/持久化/分区器/累加器/广播变量

sparksql缓存表能做广播变量吗

Spark流处理中的广播变量