spark的bulkload报错及解决
Posted niutao
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了spark的bulkload报错及解决相关的知识,希望对你有一定的参考价值。
将HDFS上的数据解析出来,然后通过hfile方式批量写入Hbase(需要多列写入)
写入数据的关键api:
rdd.saveAsNewAPIHadoopFile(
stagingFolder,
classOf[ImmutableBytesWritable],
classOf[KeyValue],
classOf[HFileOutputFormat2],
job.getConfiguration)
特殊地方:
Does it contain files in subdirectories that correspond to column family names
A:代码问题
B:数据源问题
C:setMapOutputKeyClass 和 saveAsNewAPIHadoopFile中的Class不一致
(我的是数据源问题)
2):
正常些put操作的时候,服务端自动帮助排序,因此在使用put操作的时候没有涉及到这样的错误:
Added a key not lexically larger than previous
但是在写hfile的时候如果出现报错:
Added a key not lexically larger than previous
这样的错误,一般会认为rowkey没有做好排序,然后傻fufu的去验证了一下,rowkey的确做了排序
spark写hfile时候是按照rowkey+列族+列名进行排序的,因此在写入数据的时候,要做到整体有序
(事情还没完)
3):
因为需要多列写入,最好的方式:要么反射来动态获取列名称和列值 、 要么通过datafame去获取(df.columns)
反射方式:
val listData: RDD[(ImmutableBytesWritable, ListBuffer[KeyValue])] = rdd.map { line => val rowkey = line.vintime val clazz = Class.forName(XXXXXXXXXXXXXXXX) val fields = clazz.getDeclaredFields var list = new ListBuffer[String]() var kvlist = new ListBuffer[KeyValue]()// if (fields != null && fields.size > 0) { for (field <- fields) { field.setAccessible(true) val column = field.getName list.append(column) } } val newList = list.sortWith(_ < _) val ik = new ImmutableBytesWritable(Bytes.toBytes(rowkey)) for(column <- newList){ val declaredField: Field = line.getClass.getDeclaredField(column) declaredField.setAccessible(true) val value = declaredField.get(line).toString val kv: KeyValue = new KeyValue( Bytes.toBytes(rowkey), Bytes.toBytes(columnFamily), Bytes.toBytes(column), Bytes.toBytes(value)) kvlist.append(kv) } (ik, kvlist) }
datafame方式:
val tmpData: RDD[(ImmutableBytesWritable, util.LinkedList[KeyValue])] = df.rdd.map( line =>{ val rowkey = line.getAs[String]("vintime") val ik = new ImmutableBytesWritable(Bytes.toBytes(rowkey)) var linkedList = new util.LinkedList[KeyValue]() for (column <- columns) { val kv: KeyValue = new KeyValue( Bytes.toBytes(rowkey), Bytes.toBytes(columnFamily), Bytes.toBytes(column), Bytes.toBytes(line.getAs[String](column))) linkedList.add(kv) } (ik, linkedList) }) val result: RDD[(ImmutableBytesWritable, KeyValue)] = tmpData.flatMapValues( s => { val values: Iterator[KeyValue] = JavaConverters.asScalaIteratorConverter(s.iterator()).asScala values } ).sortBy(x =>x._1 , true)
仔细观察可以发现,其实两者都做了排序操作,但是即便经过(1)步骤后仍然报错:
Added a key not lexically larger than previous
那么在回想一下之前写hfile的要求:
rowkey+列族+列都要有序,那么如果出现数据的重复,也不算是有序的操作!
因为,做一下数据的去重:
val key: RDD[(String, TransferTime)] = data.reduceByKey((x, y) => y)
val unitData: RDD[TransferTime] = key.map(line => line._2)
果然,这样解决了:Added a key not lexically larger than previous这个异常
但是会报如下另一个异常:
Kryo serialization failed: Buffer overflow
这个是因为在对一些类做kryo序列化时候,数据量的缓存大小超过了默认值,做一下调整即可
sparkConf.set("spark.kryoserializer.buffer.max" , "256m")
sparkConf.set("spark.kryoserializer.buffer" , "64m")
完整代码:
/** * Created by angel */ object WriteTransferTime extends WriteToHbase{ /** * @param data 要插入的数据 * @param tableName 表名 **/ override def bulkLoadData(data: RDD[Any], tableName: String , columnFamily:String): Unit = { val bean: RDD[TransferTime] = data.map(line => line.asInstanceOf[TransferTime]) val map: RDD[(String, TransferTime)] = bean.map(line => (line.vintime , line)) val key: RDD[(String, TransferTime)] = map.reduceByKey((x, y) => y) val map1: RDD[TransferTime] = key.map(line => line._2) val by1: RDD[TransferTime] = map1.sortBy(f => f.vintime) val listData: RDD[(ImmutableBytesWritable, ListBuffer[KeyValue])] = by1.map { line => val rowkey = line.vintime val clazz = Class.forName("com.dongfeng.code.Bean.message.TransferTime") val fields = clazz.getDeclaredFields var list = new ListBuffer[String]() var kvlist = new ListBuffer[KeyValue]()// if (fields != null && fields.size > 0) { for (field <- fields) { field.setAccessible(true) val column = field.getName list.append(column) } } val newList = list.sortWith(_ < _) val ik = new ImmutableBytesWritable(Bytes.toBytes(rowkey)) for(column <- newList){ val declaredField: Field = line.getClass.getDeclaredField(column) declaredField.setAccessible(true) val value = declaredField.get(line).toString val kv: KeyValue = new KeyValue( Bytes.toBytes(rowkey), Bytes.toBytes(columnFamily), Bytes.toBytes(column), Bytes.toBytes(value)) kvlist.append(kv) } (ik, kvlist) } val result: RDD[(ImmutableBytesWritable, KeyValue)] = listData.flatMapValues( s => { val values: Iterator[KeyValue] = s.iterator values } ) val resultDD: RDD[(ImmutableBytesWritable, KeyValue)] = result.sortBy(x =>x._1 , true) WriteToHbaseDB.hfile_load(result , TableName.valueOf(tableName) , columnFamily) } }
def hfile_load(rdd:RDD[(ImmutableBytesWritable , KeyValue)] , tableName: TableName , columnFamily:String): Unit ={ //声明表的信息 var table: Table = null try{ val startTime = System.currentTimeMillis() println(s"开始时间:-------->${startTime}") //生成的HFile的临时保存路径 val stagingFolder = "hdfs://cdh1:9000/hfile/"+tableName+new Date().getTime// table = connection.getTable(tableName) //如果表不存在,则创建表 if(!admin.tableExists(tableName)){ createTable(tableName , columnFamily) } //开始导入 val job = Job.getInstance(config) job.setJobName("DumpFile") job.setMapOutputKeyClass(classOf[ImmutableBytesWritable]) job.setMapOutputValueClass(classOf[KeyValue]) rdd.sortBy(x => x._1, true).saveAsNewAPIHadoopFile( stagingFolder, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], job.getConfiguration) val load = new LoadIncrementalHFiles(config) val regionLocator = connection.getRegionLocator(tableName) HFileOutputFormat2.configureIncrementalLoad(job, table, regionLocator) load.doBulkLoad(new Path(stagingFolder), table.asInstanceOf[HTable]) // load.doBulkLoad(new Path(stagingFolder) , connection.getAdmin , table , regionLocator) val endTime = System.currentTimeMillis() println(s"结束时间:-------->${endTime}") println(s"花费的时间:----------------->${(endTime - startTime)}ms") }catch{ case e:IOException => e.printStackTrace() }finally { if (table != null) { try { // 关闭HTable对象 table.close(); } catch { case e: IOException => e.printStackTrace(); } } if (connection != null) { try { //关闭hbase连接. connection.close(); } catch { case e: IOException => e.printStackTrace(); } } } } }
以上是关于spark的bulkload报错及解决的主要内容,如果未能解决你的问题,请参考以下文章
Spark 实战系列Spark 使用 BulkLoad 同步数据到 hbase 排序优化
Spark 实战系列Spark 使用 BulkLoad 同步数据到 hbase 排序优化
Python:(人工智能识别手写数字)使用卷积神经网络代码多个报错及相应解决方法