HBase Indexer 整合 Solr

Posted 2023-02-28 sxhong

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了HBase Indexer 整合 Solr相关的知识，希望对你有一定的参考价值。

Lily HBase Indexer提供了快速、简单的HBase的内容检索方案，它可以帮助你在Solr中建立HBase的数据索引，从而通过Solr进行数据检索。由于索引过程是异步的，所以并不影响HBase的写负载，同时借助 SolrCloud 可实现分布式索引。

该项目起源于多年研究HBase索引方案的平台Lily。

Lily HBase Indexer

参考：
http://ngdata.github.io/hbase-indexer/
https://github.com/NGDATA/hbase-indexer/wiki

工作机制

HBase Indexer 通过HBase 的复制功能来实现，当数据写入HBase Region 时，数据会被异步复制到 HBase索引协处理器，索引优化器创建文档并推送到 SolrCloud Servers 中。

借助Zookeeper，索引，HBase都可以水平扩展。

通常索引节点和 Solr 节点失效都不会导致数据丢失。

独立安装

要求：

HBase 0.94.x
Solr4.x
zookeeper 3.x

HBase Indexer 依赖 Zookeeper，所以需要配置自己的Zookeeper配置段与HBase的配置段：
conf/hbase-indexer-site.xml

<property>
  <name>hbaseindexer.zookeeper.connectstring</name>
  <value>zookeeperhost</value>
</property>
<property>
  <name>hbase.zookeeper.quorum</name>
  <value>zookeeperhost</value>
</property>

HBase 开启复制机制，hbase-site.xml

<configuration>
  <!-- SEP is basically replication, so enable it -->
  <property>
    <name>hbase.replication</name>
    <value>true</value>
  </property>
  <property>
    <name>replication.source.ratio</name>
    <value>1.0</value>
  </property>
  <property>
    <name>replication.source.nb.capacity</name>
    <value>1000</value>
  </property>
  <property>
    <name>replication.replicationsource.implementation</name>
    <value>com.ngdata.sep.impl.SepReplicationSource</value>
  </property>
</configuration>

复制 jar 包到HBase

cp lib/hbase-sep-* $HBASE_HOME/lib

运行 solr

cd $SOLR_HOME/example
java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkHost=localhost:2181/solr  -jar start.jar

运行 HBase-indexer

cd $INDEXER_HOME
./bin/hbase-indexer server

Cloudear CDH 集成Solr

Cloudear CDH 套件中集成了该组件，可以通过服务添加。虽然CDH的组件版本比要求更新，但是 Cloudera 做好了兼容性补丁，并且免去了一些配置，和jar包的复制。

直接添加这两个服务即可，Solr的分布式依赖Zookeeper组件，另外需要注意的是 Solr 需要在配置界面，进行初始化以后才能启动。

安装完成后，启动HBase-Indexer与Solr，然后需要以下几步：

配置 HBase ，打开的复制配置，通过CDH管理界面
创建一个基于SolrCloud 的 Core，和独立安装不同，Core和索引会创建到 HDFS 上，可通过 CDH自带的命令 solrctl完成

# 初始化目录
solrctl instancedir --generate /opt/lib/solr/collection1
# 修改solr字段配置
vi $HOME/collection1/conf/schema.xml
# 生成HDFS目录
solrctl instancedir --create collection1 /opt/lib/solr/collection1
# 创建SolrCloud 的Core，可指定分片数和副本数量
solrctl collection --create collection1 -s 3 -r 1

创建测试表，增加复制参数

hbase> create 'test:content_k',  NAME => 'info', REPLICATION_SCOPE => '1'

修改已存在的测试表

hbase> disable 'test:content_k'
hbase> alter 'test:content_k', NAME=>'d',REPLICATION_SCOPE => 1
hbase> enable 'test:content_k'

创建索引配置文件

vi index_test_content_k.xml

<?xml version="1.0"?>
<indexer table="test:content_k">
  <field name="md5_s" value="d:md5"/>
  <field name="title_s" value="d:title"/>
  <field name="catid_i" value="d:catid" type="int"/>
  <field name="modelid_i" value="d:modelid" type="int"/>
  <field name="published_i" value="d:published" type="int"/>
  <field name="publishedby_i" value="d:publishedby" type="int"/>
  <field name="time_s" value="d:time"/>
  <field name="datetime_s" value="d:datetime"/>
  <field name="status_i" value="d:status" type="int"/>
</indexer>

创建Solr索引

hbase-indexer add-indexer -n index_test_content_k -c index_test_content_k.xml \\
-cp solr.zk=hadoop-03,hadoop-02,hadoop-05:2181/solr -cp solr.collection=collection1

查看索引

hbase-indexer list-indexers -z hadoop-03,hadoop-02,hadoop-05:2181

更新数据测试

注意：Solr是否立即更新依赖 autocommit 这项配置

常用操作

索引配置

最简单的索引配置只有一个表名和一个字段，就像这样：

<indexer table="mytable">
  <field name="fieldname" value="columnfamily:qualifier" type="string"/>
</indexer>

全局索引属性

table

指定被索引的 HBase 的表名，是索引配置的唯一必须值

mapping-type

映射类型，因为HBase是列式数据库，可能有高表和宽表的概念，所以该字段指定索引是基于行还是基于列，默认是基于行。

read-row

该属性有两个值，dynamic 与 never。默认是 dynamic

当使用基于行的映射类型时，该属性比较重要，它决定了是否从 HBase 来更新已存在数据的索引。

mapper

自定义映射属性，默认是com.ngdata.hbaseindexer.parse.DefaultResultToSolrMapper

unique-key-formatter

唯一键格式，默认使用com.ngdata.hbaseindexer.uniquekey.StringUniqueKeyFormatter，会将 HBase 的Row Key作为字符串来处理，当然你可以扩展该属性。

unique-key-field

指定 Solr 的唯一键，默认是id这个字段

row-field

指定 Solr 用于存储 HBase Rowkey 的字段名，默认为空，意味着rowkey 不会存储在Solr的索引中

column-family-field

指定 Solr 用于存储 HBase 列族的字段名，默认为空，同上。

table-name-field

指定 Solr 用于存储 HBase 表名称的字段名，默认为空，同上。

索引元素定义

索引元素只有三种类型：

field

该类型定义一个独立的用于 Solr 索引的字段，包含三个属性

name，索引字段名称
value, 索引值，来源于 HBase 的值，可以有三种表示方法：

mycolumnfamily:myqualifier
mycolumnfamily:my*
mycolumnfamily:*

source, 决定用于索引的类型，有两种类型：value 与 qualifier，表示值还是列标识符

type，指定 HBase的索引数据类型，因为HBase 存储所有的数据类型都是字节数组，而Solr一般都是作为文本来进行索引，所以需要字节数组向真实数据类型的转换。这个属性值可以为 HBase 的 Byte类所支持的所有类型： int, long, string ,boolean, float, double, short, bigdecimalj。当然你也可以自定义。

param

定义了一个键值对，具体使用场景不明。

配置范例

<!--
   Do row-based indexing on table "table1", never re-reading updated content.
   Store the unique document id in Solr field called "custom-id".
   Additionally store the row key in a Solr field called "custom-row", and store the 
   column family in a Solr field called "custom-family".

   Perform conversion of byte array keys using the class "com.mycompany.MyKeyFormatter".
--> 
<indexer
    table="table1"
    mapping-type="row"
    read-row="never"
    unique-key-field="custom-id"
    row-field="custom-row"
    column-family-field="custom-family"
    table-name-field="custom-table"
    unique-key-formatter="com.mycompany.MyKeyFormatter"
    >

  <!-- A float-based field taken from any qualifier in the column family "colfam" -->
  <field name="field1" value="colfam:*" source="qualifier" type="float"/>

  <param name="globalKeyA" value="globalValueA"/>
  <param name="globalKeyB" value="globalValueB"/>

</indexer>r

数据批量导入

批量导入已存在的 HBase 表数据，HBase-Indexer提供了MR工具，来完成批量导入功能

直接导入

sudo -u hdfs hadoop jar hbase-indexer-mr-1.5-cdh5.5.1-job.jar --hbase-indexer-zk hadoop-02:2181,hadoop-03:2181,hadoop-05:2181 --hbase-indexer-name index_test_content_k --reducers 0

参考: http://www.niuchaoqun.com/14543825447680.html

以上是关于HBase Indexer 整合 Solr的主要内容，如果未能解决你的问题，请参考以下文章