Hive 数据导入 HBase
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive 数据导入 HBase相关的知识,希望对你有一定的参考价值。
参考技术Ahttps://segmentfault.com/a/1190000011616473
一、Hive 跑批
1.建表
默认第一个字段会作为hbase的rowkey。
2.导入数据
将userid插入到列key,作为hbase表的rowkey。
二、生成中间的HFile
-Dimporttsv.bulk.output : HFile输出目录
-Dimporttsv.columns:HBase表中的列簇、列名称,注意顺序与Hive表一致。
binlog_ns:hbase_hfile_load_table :binlog_ns 名称空间下的hbase_hfile_load_table表
hdfs://namespace1/apps/hive/warehouse/original_tmp_db.db/hbase_hfile_table : 是Hive 表original_tmp_db.hbase_hfile_table的数据路径
ImportTsv 会去读取 Hive表数据目录中的文件,并分析 hbase table 的region 分布, 生成对应region的hfile, 放到 -Dimporttsv.bulk.output目录下
三、通过bulkload 加载HFile到HBase表
读取HFile目录下文件,加载到HBase表中
建立Hive和Hbase的映射关系,通过Spark将Hive表中数据导入ClickHouse
HBase+Hive+Spark+ClickHouse
在HBase中建表,通过Hive与HBase建立映射关系,实现双方新增数据后彼此都可以查询到。
通过spark将Hive中的数据读取到并经过处理保存到ClickHouse中
一 Hbase
1 Hbase表操作
1.1 创建命名空间
hbase(main):008:0> create_namespace 'zxy',{'hbasename'=>'hadoop'}
0 row(s) in 0.0420 seconds
1.2 创建列簇
hbase(main):012:0> create 'zxy:t1',{NAME=>'f1',VERSIONS=>5}
0 row(s) in 2.4850 seconds
hbase(main):014:0> list 'zxy:.*'
TABLE
zxy:t1
1 row(s) in 0.0200 seconds
=> ["zxy:t1"]
hbase(main):015:0> describe 'zxy:t1'
Table zxy:t1 is ENABLED
zxy:t1
COLUMN FAMILIES DESCRIPTION
{NAME => 'f1', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_C
ELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', M
IN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{
1.3 按行导入数据
hbase(main):016:0> put 'zxy:t1','r1','f1:name','zxy'
0 row(s) in 0.1080 seconds
hbase(main):028:0> append 'zxy:t1','r1','f1:id','001'
0 row(s) in 0.0400 seconds
hbase(main):029:0> scan 'zxy:t1'
ROW COLUMN+CELL
r1 column=f1:id, timestamp=1627714724257, value=001
r1 column=f1:name, timestamp=1627714469210, value=zxy
1 row(s) in 0.0120 seconds
hbase(main):030:0> append 'zxy:t1','r2','f1:id','002'
0 row(s) in 0.0060 seconds
hbase(main):031:0> append 'zxy:t1','r2','f1:name','bx'
0 row(s) in 0.0080 seconds
hbase(main):032:0> append 'zxy:t1','r3','f1:id','003'
0 row(s) in 0.0040 seconds
hbase(main):033:0> append 'zxy:t1','r3','f1:name','zhg'
0 row(s) in 0.0040 seconds
hbase(main):034:0> scan 'zxy:t1'
ROW COLUMN+CELL
r1 column=f1:id, timestamp=1627714724257, value=001
r1 column=f1:name, timestamp=1627714469210, value=zxy
r2 column=f1:id, timestamp=1627714739647, value=002
r2 column=f1:name, timestamp=1627714754108, value=bx
r3 column=f1:id, timestamp=1627714768018, value=003
r3 column=f1:name, timestamp=1627714778121, value=zhg
3 row(s) in 0.0190 seconds
二 Hive
1 建立Hbase关联表
hive (zxy)> create external table if not exists t1(
> uid string,
> id int,
> name string
> )
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:id,f1:name")
> TBLPROPERTIES ("hbase.table.name" = "zxy:t1");
OK
Time taken: 0.306 seconds
hive (zxy)> select * from t1
> ;
OK
r1 1 zxy
r2 2 bx
r3 3 zhg
Time taken: 0.438 seconds, Fetched: 3 row(s)
2 Hbase添加数据
hbase添加数据
hbase(main):002:0> append 'zxy:t1','r4','f1:id','004'
0 row(s) in 0.1120 seconds
hbase(main):003:0> append 'zxy:t1','r4','f1:name','hyy'
0 row(s) in 0.0220 seconds
hbase(main):004:0> scan 'zxy:t1'
ROW COLUMN+CELL
r1 column=f1:id, timestamp=1627714724257, value=001
r1 column=f1:name, timestamp=1627714469210, value=zxy
r2 column=f1:id, timestamp=1627714739647, value=002
r2 column=f1:name, timestamp=1627714754108, value=bx
r3 column=f1:id, timestamp=1627714768018, value=003
r3 column=f1:name, timestamp=1627714778121, value=zhg
r4 column=f1:id, timestamp=1627716660482, value=004
r4 column=f1:name, timestamp=1627716670546, value=hyy
hive更新到数据
hive (zxy)> select * from t1;
OK
r1 1 zxy
r2 2 bx
r3 3 zhg
r4 4 hyy
3 Hive添加数据
hive添加数据不能直接通过load添加数据,所以这里选择使用中间表来导入数据
- user.txt
r5 5 tzk
r6 6 fyj
- 创建中间表
hive (zxy)> create table if not exists t2 (uid string,id int,name string) row format delimited fields terminated by ' '
> ;
OK
Time taken: 0.283 seconds
- 导入数据
hive (zxy)> load data local inpath '/data/data/user.txt' into table t2;
Loading data to table zxy.t2
Table zxy.t2 stats: [numFiles=1, totalSize=18]
OK
- 查询导入中间表数据
hive (zxy)> insert into table t1 select * from t2;
Query ID = root_20210731154037_e8019cc0-38bb-42fc-9674-a9de2be9dba6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1627713883513_0001, Tracking URL = http://hadoop:8088/proxy/application_1627713883513_0001/
Kill Command = /data/apps/hadoop-2.8.1/bin/hadoop job -kill job_1627713883513_0001
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2021-07-31 15:41:23,373 Stage-0 map = 0%, reduce = 0%
2021-07-31 15:41:34,585 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 3.45 sec
MapReduce Total cumulative CPU time: 3 seconds 450 msec
Ended Job = job_1627713883513_0001
MapReduce Jobs Launched:
Stage-Stage-0: Map: 1 Cumulative CPU: 3.45 sec HDFS Read: 3659 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 450 msec
OK
Time taken: 60.406 seconds
Hive端查询数据
hive (zxy)> select * from t1;
OK
r1 1 zxy
r2 2 bx
r3 3 zhg
r4 4 hyy
r5 5 tzk
r6 6 fyj
Time taken: 0.335 seconds, Fetched: 6 row(s)
hive (zxy)>
Hbase端查询到数据
hbase(main):001:0> scan 'zxy:t1'
ROW COLUMN+CELL
r1 column=f1:id, timestamp=1627714724257, value=001
r1 column=f1:name, timestamp=1627714469210, value=zxy
r2 column=f1:id, timestamp=1627714739647, value=002
r2 column=f1:name, timestamp=1627714754108, value=bx
r3 column=f1:id, timestamp=1627714768018, value=003
r3 column=f1:name, timestamp=1627714778121, value=zhg
r4 column=f1:id, timestamp=1627716660482, value=004
r4 column=f1:name, timestamp=1627716670546, value=hyy
r5 column=f1:id, timestamp=1627717294053, value=5
r5 column=f1:name, timestamp=1627717294053, value=tzk
r6 column=f1:id, timestamp=1627717294053, value=6
r6 column=f1:name, timestamp=1627717294053, value=fyj
6 row(s) in 0.4660 seconds
三 Hive2ClickHouse
1 pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.zxy</groupId>
<artifactId>hive2ch</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<scala.version>2.11.12</scala.version>
<play-json.version>2.3.9</play-json.version>
<maven-scala-plugin.version>2.10.1</maven-scala-plugin.version>
<scala-maven-plugin.version>3.2.0</scala-maven-plugin.version>
<maven-assembly-plugin.version>2.6</maven-assembly-plugin.version>
<spark.version>2.4.5</spark.version>
<scope.type>compile</scope.type>
<json.version>1.2.3</json.version>
<!--compile provided-->
</properties>
<dependencies>
<!--json 包-->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>${json.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<scope>${scope.type}</scope>
<exclusions>
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>15.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
<scope>${scope.type}</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
<scope>${scope.type}</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
<scope>${scope.type}</scope>
</dependency>
<dependency>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
<version>1.6</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<scope>${scope.type}</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>${scala.version}</version>
<scope>${scope.type}</scope>
</dependency>
<dependency>
<groupId>com.github.scopt</groupId>
<artifactId>scopt_2.11</artifactId>
<version>4.0.0-RC2</version>
</dependency>
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-spark-bundle_2.11</artifactId>
<version>0.5.2-incubating</version>
<scope>${scope.type}</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.7.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>${spark.version}</version>
<scope>${scope.type}</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1</version>
<scope>${scope.type}</scope>
<exclusions>
<exclusion>
<groupId>javax.mail</groupId>
<artifactId>mail</artifactId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jetty.aggregate</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>ru.yandex.clickhouse</groupId>
<artifactId>clickhouse-jdbc</artifactId>
<version>0.2.4</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-hbase-handler</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.2.0</version>
</dependency>
</dependencies>
<repositories>
<repository>
<id>alimaven</id>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases>
<updatePolicy>never</updatePolicy>
</releases>
<snapshots>
<updatePolicy>never</updatePolicy>
</snapshots>
</repository>
</repositories>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins以上是关于Hive 数据导入 HBase的主要内容,如果未能解决你的问题,请参考以下文章
sqoop命令,oracle导入到hdfs、hbase、hive