大数据技术之Hive查询分区表和分桶表

Posted @从一到无穷大

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了大数据技术之Hive查询分区表和分桶表相关的知识,希望对你有一定的参考价值。


1 查询

查询语句语法:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
	FROM table_reference
	[WHERE where_condition]
	[GROUP BY col_list]
	[ORDER BY col_list]
	[CLUSTER BY col_list]
		| [DISTRIBUTE BY col_list] [SORT BY col_list]
	[LIMIT number]

1.1 基本查询(Select…From)

1.1.1 全表和特定列查询

1. 数据准备
(1)原始数据
dept:

10      ACCOUNTING      1700
20      RESEARCH        1800
30      SALES   1900
40      OPERATIONS      1700

emp:

369    SMITH   CLERK   7902    1980-12-17      800.00          20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.00 300.00  30
7521    WARD    SALESMAN        7698    1981-2-22       1250.00 500.00  30
7566    JONES   MANAGER 7839    1981-4-2        2975.00         20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.00 1400.00 30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.00         30
7782    CLARK   MANAGER 7839    1981-6-9        2450.00         10
7788    SCOTT   ANALYST 7566    1987-4-19       3000.00         20
7839    KING    PRESIDENT               1981-11-17      5000.00         10
7844    TURNER  SALESMAN        7698    1981-9-8        1500.00 0.00    30
7876    ADAMS   CLERK   7788    1987-5-23       1100.00         20
7900    JAMES   CLERK   7698    1981-12-3       950.00          30
7902    FORD    ANALYST 7566    1981-12-3       3000.00         20
7934    MILLER  CLERK   7782    1982-1-23       1300.00         10
7534    MILLER  CLERK   7782    1982-10-23      1300.00         50

(2)创建部门表

create table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\\t';

(3)创建员工表

create table if not exists emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
row format delimited fields terminated by '\\t';

(4)导入数据

hive (default)> load data local inpath '/opt/module/hive-3.1.2/data/dept.txt' into table dept;
hive (default)> load data local inpath '/opt/module/hive-3.1.2/data/emp.txt' into table emp;

2. 全表查询

hive (default)> select * from emp;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7369	SMITH	CLERK	7902	1980-12-17	800.0	NULL	20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7534	MILLER	CLERK	7782	1982-10-23	1300.0	NULL	50
Time taken: 0.271 seconds, Fetched: 15 row(s)
hive (default)> select empno, ename, job, mgr, hiredate, sal, comm, deptno from emp;
OK
empno	ename	job	mgr	hiredate	sal	comm	deptno
7369	SMITH	CLERK	7902	1980-12-17	800.0	NULL	20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7534	MILLER	CLERK	7782	1982-10-23	1300.0	NULL	50
Time taken: 0.323 seconds, Fetched: 15 row(s)

3. 选择特定列查询

hive (default)> select empno, ename from emp;
OK
empno	ename
7369	SMITH
7499	ALLEN
7521	WARD
7566	JONES
7654	MARTIN
7698	BLAKE
7782	CLARK
7788	SCOTT
7839	KING
7844	TURNER
7876	ADAMS
7900	JAMES
7902	FORD
7934	MILLER
7534	MILLER
Time taken: 0.279 seconds, Fetched: 15 row(s)

注意:
(1)SQL 语言 大小写不敏感。
(2)SQL 可以写在一行或者多行
(3)关键字不能被缩写也不能分行
(4)各子句一般要分行写。
(5)使用缩进提高语句的可读性。

1.1.2 列别名

重命名一个列便于计算。命名方法为紧跟列名,也可以在列名和别名之间加入关键字AS
实例:查询名称和部门

hive (default)> select ename as name, deptno dn from emp;
OK
name	dn
SMITH	20
ALLEN	30
WARD	30
JONES	20
MARTIN	30
BLAKE	30
CLARK	10
SCOTT	20
KING	10
TURNER	30
ADAMS	20
JAMES	30
FORD	20
MILLER	10
MILLER	50
Time taken: 0.289 seconds, Fetched: 15 row(s)

1.1.3 算术运算符


案例实操:查询出所有员工的薪水后加 1 显示。

hive (default)> select sal+1 from emp;
OK
_c0
801.0
1601.0
1251.0
2976.0
1251.0
2851.0
2451.0
3001.0
5001.0
1501.0
1101.0
951.0
3001.0
1301.0
1301.0
Time taken: 0.79 seconds, Fetched: 15 row(s)

1.1.4 常用函数

(1)求总行数(count)

hive (default)> select count(*) cnt from emp;
OK
cnt
15
Time taken: 20.637 seconds, Fetched: 1 row(s)

(2)求工资的最大值(max)

hive (default)> select max(sal) max_sal from emp;
OK
max_sal
5000.0
Time taken: 19.305 seconds, Fetched: 1 row(s)

(3)求工资的最小值(min)

hive (default)> select min(sal) min_sal from emp;
OK
min_sal
800.0
Time taken: 31.402 seconds, Fetched: 1 row(s)

(4)求工资的总和(sum)

hive (default)> select sum(sal) sum_sal from emp;
OK
sum_sal
30325.0
Time taken: 8.185 seconds, Fetched: 1 row(s)

(5)求工资的平均值(avg)

hive (default)> select avg(sal) avg_sal from emp;
OK
avg_sal
2021.6666666666667
Time taken: 8.706 seconds, Fetched: 1 row(s)

1.1.5 Limit 语句

典型的查询会返回多行数据。LIMIT 子句用于限制返回的行数。

hive (default)> select * from emp limit 3;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7369	SMITH	CLERK	7902	1980-12-17	800.0	NULL	20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
Time taken: 0.318 seconds, Fetched: 3 row(s)
hive (default)> select sal from emp limit 5;
OK
sal
800.0
1600.0
1250.0
2975.0
1250.0
Time taken: 0.298 seconds, Fetched: 5 row(s)

1.1.6 Where 语句

使用 WHERE 子句,将不满足条件的行过滤掉。
WHERE 子句紧随 FROM 子句。WHERE 子句中不能使用字段别名。
例:查询出薪水大于2000 的所有员工:

hive (default)> select * from emp where sal > 2000;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
Time taken: 0.324 seconds, Fetched: 6 row(s)

1.1.7 比较运算符(Between / In / Is Null)

下面表中描述了谓词操作符,这些操作符同样可以用于 JOIN…ON和 HAVING语句中。



例:(1)查询出薪水等于 5000 的所有员工

hive (default)> select * from emp where sal=5000;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
Time taken: 0.312 seconds, Fetched: 1 row(s)

(2)查询工资在 800到 950 的员工信息

hive (default)> select * from emp where sal between 800 and 950;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7369	SMITH	CLERK	7902	1980-12-17	800.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
Time taken: 0.267 seconds, Fetched: 2 row(s)

(3)查询 comm 为空的所有员工信息

hive (default)> select * from emp where comm is null;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7369	SMITH	CLERK	7902	1980-12-17	800.0	NULL	20
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
7534	MILLER	CLERK	7782	1982-10-23	1300.0	NULL	50
Time taken: 0.283 seconds, Fetched: 11 row(s)

(4)查询工资是 1500 或 5000 的员工信息

hive (default)> select * from emp where sal in (1500, 5000);
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
Time taken: 0.299 seconds, Fetched: 2 row(s)

1.1.8 Like和RLike

(1)使用 LIKE 运算选择类似的值。

(2)选择条件可以包含字符或数字。
% 代表零个或多个字符任意个字符 。
_ 代表一个字符。

(3)RLIKE 子句 是 Hive 中这个功能的一个扩展,其可以通过 Java 的正则表达式这个更强大的语言来指定匹配条件。

案例实操
查找名字以 A 开头的员工信息

hive (default)> select * from emp where ename like 'A%';
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20

查找名字中第二个字母为 A 的员工信息

hive (default)> select * from emp where ename like '_A%';
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
Time taken: 0.301 seconds, Fetched: 3 row(s)

查找名字中带有 A 的员工信息

hive (default)> select * from emp where ename rlike '[A]';
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7876	ADAMS	CLERK	7788	1987-<

以上是关于大数据技术之Hive查询分区表和分桶表的主要内容,如果未能解决你的问题,请参考以下文章

入门大数据---Hive分区表和分桶表

大数据之hive:hive分桶表

Hive:第 7 章 分区表和分桶表

hive的建表,及分区表和分桶表的基本操作

分区表和分桶表

分区和分桶区别