Hive 数据导入 HBase

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive 数据导入 HBase相关的知识,希望对你有一定的参考价值。

参考技术A

https://segmentfault.com/a/1190000011616473

一、Hive 跑批
1.建表

默认第一个字段会作为hbase的rowkey。

2.导入数据

将userid插入到列key,作为hbase表的rowkey。

二、生成中间的HFile

-Dimporttsv.bulk.output : HFile输出目录
-Dimporttsv.columns:HBase表中的列簇、列名称,注意顺序与Hive表一致。
binlog_ns:hbase_hfile_load_table :binlog_ns 名称空间下的hbase_hfile_load_table表
hdfs://namespace1/apps/hive/warehouse/original_tmp_db.db/hbase_hfile_table : 是Hive 表original_tmp_db.hbase_hfile_table的数据路径

ImportTsv 会去读取 Hive表数据目录中的文件,并分析 hbase table 的region 分布, 生成对应region的hfile, 放到 -Dimporttsv.bulk.output目录下

三、通过bulkload 加载HFile到HBase表

读取HFile目录下文件,加载到HBase表中

建立Hive和Hbase的映射关系,通过Spark将Hive表中数据导入ClickHouse

HBase+Hive+Spark+ClickHouse

在HBase中建表,通过Hive与HBase建立映射关系,实现双方新增数据后彼此都可以查询到。

通过spark将Hive中的数据读取到并经过处理保存到ClickHouse中

一 Hbase

1 Hbase表操作

1.1 创建命名空间

hbase(main):008:0> create_namespace 'zxy',{'hbasename'=>'hadoop'}
0 row(s) in 0.0420 seconds

1.2 创建列簇

hbase(main):012:0> create 'zxy:t1',{NAME=>'f1',VERSIONS=>5}
0 row(s) in 2.4850 seconds


hbase(main):014:0> list 'zxy:.*'
TABLE
zxy:t1
1 row(s) in 0.0200 seconds

=> ["zxy:t1"]
hbase(main):015:0> describe 'zxy:t1'
Table zxy:t1 is ENABLED
zxy:t1
COLUMN FAMILIES DESCRIPTION
{NAME => 'f1', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_C
ELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', M
IN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{

1.3 按行导入数据

hbase(main):016:0> put 'zxy:t1','r1','f1:name','zxy'
0 row(s) in 0.1080 seconds
hbase(main):028:0> append 'zxy:t1','r1','f1:id','001'
0 row(s) in 0.0400 seconds

hbase(main):029:0> scan 'zxy:t1'
ROW                     COLUMN+CELL
 r1                     column=f1:id, timestamp=1627714724257, value=001
 r1                     column=f1:name, timestamp=1627714469210, value=zxy
1 row(s) in 0.0120 seconds

hbase(main):030:0> append 'zxy:t1','r2','f1:id','002'
0 row(s) in 0.0060 seconds

hbase(main):031:0> append 'zxy:t1','r2','f1:name','bx'
0 row(s) in 0.0080 seconds

hbase(main):032:0> append 'zxy:t1','r3','f1:id','003'
0 row(s) in 0.0040 seconds

hbase(main):033:0> append 'zxy:t1','r3','f1:name','zhg'
0 row(s) in 0.0040 seconds

hbase(main):034:0> scan 'zxy:t1'
ROW                     COLUMN+CELL
 r1                     column=f1:id, timestamp=1627714724257, value=001
 r1                     column=f1:name, timestamp=1627714469210, value=zxy
 r2                     column=f1:id, timestamp=1627714739647, value=002
 r2                     column=f1:name, timestamp=1627714754108, value=bx
 r3                     column=f1:id, timestamp=1627714768018, value=003
 r3                     column=f1:name, timestamp=1627714778121, value=zhg
3 row(s) in 0.0190 seconds

二 Hive

1 建立Hbase关联表

hive (zxy)> create external table if not exists t1(
          > uid string,
          > id int,
          > name string
          > )
          > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
          > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:id,f1:name")
          > TBLPROPERTIES ("hbase.table.name" = "zxy:t1");
OK
Time taken: 0.306 seconds
hive (zxy)> select * from t1
          > ;
OK
r1      1       zxy
r2      2       bx
r3      3       zhg
Time taken: 0.438 seconds, Fetched: 3 row(s)

2 Hbase添加数据

  • hbase添加数据

hbase(main):002:0> append 'zxy:t1','r4','f1:id','004'
0 row(s) in 0.1120 seconds

hbase(main):003:0> append 'zxy:t1','r4','f1:name','hyy'
0 row(s) in 0.0220 seconds

hbase(main):004:0> scan 'zxy:t1'
ROW                                      COLUMN+CELL
 r1                                      column=f1:id, timestamp=1627714724257, value=001
 r1                                      column=f1:name, timestamp=1627714469210, value=zxy
 r2                                      column=f1:id, timestamp=1627714739647, value=002
 r2                                      column=f1:name, timestamp=1627714754108, value=bx
 r3                                      column=f1:id, timestamp=1627714768018, value=003
 r3                                      column=f1:name, timestamp=1627714778121, value=zhg
 r4                                      column=f1:id, timestamp=1627716660482, value=004
 r4                                      column=f1:name, timestamp=1627716670546, value=hyy
  • hive更新到数据
hive (zxy)> select * from t1;
OK
r1      1       zxy
r2      2       bx
r3      3       zhg
r4      4       hyy

3 Hive添加数据

hive添加数据不能直接通过load添加数据,所以这里选择使用中间表来导入数据

  • user.txt
r5 5 tzk
r6 6 fyj
  • 创建中间表
hive (zxy)> create table if not exists t2 (uid string,id int,name string) row format delimited fields terminated by ' '
          > ;
OK
Time taken: 0.283 seconds
  • 导入数据
hive (zxy)> load data local inpath '/data/data/user.txt' into table t2;
Loading data to table zxy.t2
Table zxy.t2 stats: [numFiles=1, totalSize=18]
OK
  • 查询导入中间表数据
hive (zxy)> insert into table t1 select * from t2;
Query ID = root_20210731154037_e8019cc0-38bb-42fc-9674-a9de2be9dba6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1627713883513_0001, Tracking URL = http://hadoop:8088/proxy/application_1627713883513_0001/
Kill Command = /data/apps/hadoop-2.8.1/bin/hadoop job  -kill job_1627713883513_0001
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2021-07-31 15:41:23,373 Stage-0 map = 0%,  reduce = 0%
2021-07-31 15:41:34,585 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 3.45 sec
MapReduce Total cumulative CPU time: 3 seconds 450 msec
Ended Job = job_1627713883513_0001
MapReduce Jobs Launched:
Stage-Stage-0: Map: 1   Cumulative CPU: 3.45 sec   HDFS Read: 3659 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 450 msec
OK
Time taken: 60.406 seconds

  • Hive端查询数据
hive (zxy)> select * from t1;
OK
r1      1       zxy
r2      2       bx
r3      3       zhg
r4      4       hyy
r5      5       tzk
r6      6       fyj
Time taken: 0.335 seconds, Fetched: 6 row(s)
hive (zxy)>
  • Hbase端查询到数据
hbase(main):001:0> scan 'zxy:t1'
ROW                                      COLUMN+CELL
 r1                                      column=f1:id, timestamp=1627714724257, value=001
 r1                                      column=f1:name, timestamp=1627714469210, value=zxy
 r2                                      column=f1:id, timestamp=1627714739647, value=002
 r2                                      column=f1:name, timestamp=1627714754108, value=bx
 r3                                      column=f1:id, timestamp=1627714768018, value=003
 r3                                      column=f1:name, timestamp=1627714778121, value=zhg
 r4                                      column=f1:id, timestamp=1627716660482, value=004
 r4                                      column=f1:name, timestamp=1627716670546, value=hyy
 r5                                      column=f1:id, timestamp=1627717294053, value=5
 r5                                      column=f1:name, timestamp=1627717294053, value=tzk
 r6                                      column=f1:id, timestamp=1627717294053, value=6
 r6                                      column=f1:name, timestamp=1627717294053, value=fyj
6 row(s) in 0.4660 seconds

三 Hive2ClickHouse

完整项目连接

1 pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.zxy</groupId>
    <artifactId>hive2ch</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <scala.version>2.11.12</scala.version>
        <play-json.version>2.3.9</play-json.version>
        <maven-scala-plugin.version>2.10.1</maven-scala-plugin.version>
        <scala-maven-plugin.version>3.2.0</scala-maven-plugin.version>
        <maven-assembly-plugin.version>2.6</maven-assembly-plugin.version>
        <spark.version>2.4.5</spark.version>
        <scope.type>compile</scope.type>
        <json.version>1.2.3</json.version>
        <!--compile provided-->
    </properties>

    <dependencies>

        <!--json 包-->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>${json.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>${scope.type}</scope>
            <exclusions>
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>15.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>${scope.type}</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>${scope.type}</scope>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.47</version>
        </dependency>

        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
            <scope>${scope.type}</scope>
        </dependency>

        <dependency>
            <groupId>commons-codec</groupId>
            <artifactId>commons-codec</artifactId>
            <version>1.6</version>
        </dependency>

        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
            <scope>${scope.type}</scope>
        </dependency>

        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-reflect</artifactId>
            <version>${scala.version}</version>
            <scope>${scope.type}</scope>
        </dependency>

        <dependency>
            <groupId>com.github.scopt</groupId>
            <artifactId>scopt_2.11</artifactId>
            <version>4.0.0-RC2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hudi</groupId>
            <artifactId>hudi-spark-bundle_2.11</artifactId>
            <version>0.5.2-incubating</version>
            <scope>${scope.type}</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-avro_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>com.hankcs</groupId>
            <artifactId>hanlp</artifactId>
            <version>portable-1.7.8</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>${scope.type}</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>1.2.1</version>
            <scope>${scope.type}</scope>
            <exclusions>
                <exclusion>
                    <groupId>javax.mail</groupId>
                    <artifactId>mail</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.eclipse.jetty.aggregate</groupId>
                    <artifactId>*</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>ru.yandex.clickhouse</groupId>
            <artifactId>clickhouse-jdbc</artifactId>
            <version>0.2.4</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>compile</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-hbase-handler</artifactId>
            <version>1.2.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>1.2.0</version>
        </dependency>

    </dependencies>

    <repositories>

        <repository>
            <id>alimaven</id>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
            <releases>
                <updatePolicy>never</updatePolicy>
            </releases>
            <snapshots>
                <updatePolicy>never</updatePolicy>
            </snapshots>
        </repository>
    </repositories>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins以上是关于Hive 数据导入 HBase的主要内容,如果未能解决你的问题,请参考以下文章

HBase数据导入Hive

导入 HDFS 数据至 HBase

sqoop命令,oracle导入到hdfs、hbase、hive

优雅的将hbase的数据导入hive表

建立Hive和Hbase的映射关系,通过Spark将Hive表中数据导入ClickHouse

教程 | 使用Sqoop从MySQL导入数据到Hive和HBase