Hive常用算子实现原理简述--MapReduce版

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive常用算子实现原理简述--MapReduce版相关的知识,希望对你有一定的参考价值。

参考技术A Hive中的常用算子包括distinct、join、group by、order by、distribute by、sort by、count等,这些操作符在SQL中使用起来很方便,能快速达到我们想要的效果,但是这些算子在底层是怎么实现的呢?

order by很容易想到执行原理,在一个reduce中将所有记录按值排序即可。因此order by在数据量大的情况下执行时间非常长,容易out of memory,非特殊业务需求一般不使用。distribute by也比较明显,根据hash值将distribute的值分发到不同的reduce。sort by是小号的order by,只负责将本reducer中的值排序,达到局部有序的效果。sort by和distribute by配合使用风味更佳,二者可以合并简写为cluster by。count则更加明晰,在combiner或reducer处按相同键累加值就能得到。

比较复杂的是distinct、join、group by,本文重点讨论这三个算子在MapReduce引擎中的大致实现原理。班门弄斧,抛砖引玉。

map阶段,将group by后的字段组合作为key,如果group by单字段那么key就一个。将group by之后要进行的聚合操作字段作为值,如要进行count,则value是1;如要sum另一个字段,则value就是该字段。

shuffle阶段,按照key的不同分发到不同的reducer。注意此时可能因为key分布不均匀而出现数据倾斜的问题。

reduce阶段,将相同key的值累加或作其他需要的聚合操作,得到结果。

对group by的过程讲解的比较清楚的是这篇文章 http://www.mamicode.com/info-detail-2292193.html 图文并茂,很生动。

实例如下图,对应语句是 select rank, isonline, count(*) from city group by rank, isonline;

如果group by出现数据倾斜,除去替换key为随机数、提前挑出大数量级key值等通用调优方法,适用于group by的特殊方法有以下几种:

(1)set hive.map.aggr=true,即开启map端的combiner,减少传到reducer的数据量,同时需设置参数hive.groupby.mapaggr.checkinterval 规定在 map 端进行聚合操作的条目数目。

(2)设置mapred.reduce.tasks为较大数量,降低每个reducer处理的数据量。

(3)set hive.groupby.skewindata=true,该参数可自动进行负载均衡。生成的查询计划会有两个 MR Job。第一个 MR Job 中,Map 的输出结果集合会随机分布到Reduce 中,每个 Reduce 做部分聚合操作,并输出结果,这样处理的结果是相同的 Group By Key有可能被分发到不同的 Reduce 中,从而达到负载均衡的目的;第二个 MR Job 再根据预处理的数据结果按照 Group ByKey 分布到 Reduce 中(这个过程可以保证相同的 Group By Key 被分布到同一个 Reduce中),最后完成最终的聚合操作。

Hive中有两种join方式:map join和common join

如果不显式指定map side join,或者没有达到触发自动map join的条件,那么会进行reduce端的join,即common join,这种join包含map、shuffle、reduce三个步骤。

(1)Map阶段

读取源表的数据,Map输出时候以Join on条件中的列为key,如果Join有多个关联键,则以这些关联键的组合作为key。Map输出的value为join之后所关心的(select或者where中需要用到的)列;同时在value中还会包含表的Tag信息,用于标明此value对应哪个表。然后按照key进行排序。

(2)Shuffle阶段

根据key的值进行hash,并将key/value按照hash值推送至不同的reduce中,这样确保两个表中相同的key位于同一个reduce中

(3)Reduce阶段

根据key的值完成join操作,期间通过Tag来识别不同表中的数据。

以下面的SQL为例,可用下图所示过程大致表达其join原理。

SELECT u.name, o.orderid FROM user u JOIN order o ON u.uid = o.uid;

关联字段是uid,因此以uid为map阶段的输出key,value为选取的字段name和标记源表的tag。shuffle阶段将相同key的键值对发到一起,reduce阶段将不同源表、同一key值的记录拼接起来,可能存在一对多的情况。

如果指定使用map join的方式,或者join的其中一张表小于某个体积(默认25MB),则会使用map join来执行。具体小表有多小,由参数 hive.mapjoin.smalltable.filesize 来决定。

Hive0.7之前,需要使用hint提示 /*+ mapjoin(table) */才会执行MapJoin,否则执行Common Join,但在0.7版本之后,默认自动会转换Map Join,由参数 hive.auto.convert.join 来控制,默认为true。

以下图为例说明map join如何执行,该图来自 http://lxw1234.com/archives/2015/06/313.htm ,博主是一个水平深厚又乐于分享的前辈,图片水印上也有其网址。

yarn会启动一个Local Task(在客户端本地执行的Task)--Task A,负责扫描小表b的数据,将其转换成一个HashTable的数据结构,并写入本地的文件中,之后将该文件加载到DistributeCache中。

接下来是Task B,该任务是一个没有Reduce的MR,启动MapTasks扫描大表a,在Map阶段,根据a的每一条记录去和DistributeCache中b表对应的HashTable关联,并直接输出结果。

由于MapJoin没有Reduce,所以由Map直接输出结果文件,有多少个Map Task,就有多少个结果文件。

distinct一般和group by同时出现。

当distinct一个字段时,将group by的字段和distinct的字段组合在一起作为map输出的key,value设置为1,同时将group by的字段定为分区键,这可以确保相同group by字段的记录都分到同一个reducer,并且map的输入天然就是按照组合key排好序的。根据分区键将记录分发到reduce端后,按顺序取出组合键中的distinct字段,这时distinct字段也是排好序的。依次遍历distinct字段,每找到一个不同值,计数器就自增1,即可得到count distinct结果。例如下面的SQL语句,过程可以下图示意。

我暂时没有理解这是怎么实现的,别人写的也没有看明白。有善良的学富五车的大佬指点一下吗?

Hive基础练习一

下面是hive基本练习,持续补充中。

简述Hive工作原理

hive是基于hadoop,可以管理hdfs上的数据的工具,它本质上是执行MapReduce程序,只是使用了类sql语句更加方便开发,hive驱动器会将类sql语句转换成MapReduce的task来执行,因此执行速度会比较慢。

hive的核心是驱动器,它可以连接sql和hdfs,将sql转换成了MapReduce任务,驱动器主要包括:

(1)解析器:解析sql语句,划分为不同的stage

(2)编译器:将不同阶段的stage编译,变成一个个的MR任务

(3)优化器:对逻辑执行进行优化

(4)执行器:将sql里的逻辑任务转换为hdfs的物理任务,hive的执行器就是MapReduce

hive 内部表和外部表区别

内部表:主要用在数据仓库层(DW层),为个人独自占有,如果删除表格,对应的原始数据也将删除,但是对其他人没有影响。

外部表:主要用在源数据层(ODS层),删除表格不会删除对应的数据。

在建表时,如果是内部表,不需要使用external关键字,外部表需要使用external关键字。

创建表格导入数据练习1

战狼2,吴京:吴刚:卢婧姗,2017-08-16
大话西游,周星驰:吴孟达,1995-09-01
哪吒,吕艳婷:瀚墨,2019-07-26
使徒行者2,张家辉:古天乐:吴镇宇,2019-08-07
鼠胆英雄,岳云鹏:佟丽娅:田雨:袁弘,2019-08-02

创建表格,导入数据。

# 建表
0: jdbc:hive2://node01:10000> create table movie_info(moviename string,actors array<string>,showtime string) row format delimited by ',' collection items teminated by ':';
Error: Error while compiling statement: FAILED: ParseException line 1:100 cannot recognize input near 'by' '','' 'collection' in serde properties specification (state=42000,code=40000)
0: jdbc:hive2://node01:10000> create table movie_info(moviename string,actors array<string>,showtime string) row format delimited fields terminated by ',' collection items teminated by ':';
Error: Error while compiling statement: FAILED: ParseException line 1:142 mismatched input 'teminated' expecting TERMINATED near 'items' in table row format's column separator (state=42000,code=40000)
0: jdbc:hive2://node01:10000> create table movie_info(moviename string,actors array<string>,showtime string) row format delimited fields terminated by ',' collection items terminated by ':';
INFO  : Compiling command(queryId=hadoop_20191115220202_01acb251-d8e2-46c5-bf20-5b354d6d2923): create table movie_info(moviename string,actors array<string>,showtime string) row format delimited fields terminated by ',' collection items terminated by ':'
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191115220202_01acb251-d8e2-46c5-bf20-5b354d6d2923); Time taken: 0.035 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191115220202_01acb251-d8e2-46c5-bf20-5b354d6d2923): create table movie_info(moviename string,actors array<string>,showtime string) row format delimited fields terminated by ',' collection items terminated by ':'
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hadoop_20191115220202_01acb251-d8e2-46c5-bf20-5b354d6d2923); Time taken: 0.286 seconds
INFO  : OK
No rows affected (0.359 seconds)
# 加载数据
0: jdbc:hive2://node01:10000> load data local inpath '/kkb/install/hivedatas/move_info.txt' into table movie_info;
INFO  : Compiling command(queryId=hadoop_20191115220505_1c2ea52f-37cf-43d2-85b0-e2b54049aa33): load data local inpath '/kkb/install/hivedatas/move_info.txt' into table movie_info
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191115220505_1c2ea52f-37cf-43d2-85b0-e2b54049aa33); Time taken: 0.061 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191115220505_1c2ea52f-37cf-43d2-85b0-e2b54049aa33): load data local inpath '/kkb/install/hivedatas/move_info.txt' into table movie_info
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table db_hive.movie_info from file:/kkb/install/hivedatas/move_info.txt
INFO  : Starting task [Stage-1:STATS] in serial mode
INFO  : Table db_hive.movie_info stats: [numFiles=1, totalSize=235]
INFO  : Completed executing command(queryId=hadoop_20191115220505_1c2ea52f-37cf-43d2-85b0-e2b54049aa33); Time taken: 0.645 seconds
INFO  : OK
No rows affected (0.726 seconds)
# 查询加载后结果
0: jdbc:hive2://node01:10000> select * from movie_info;
INFO  : Compiling command(queryId=hadoop_20191115220505_0e4cbd35-3f1c-44d8-8a03-2b7dcd7ab03e): select * from movie_info
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:movie_info.moviename, type:string, comment:null), FieldSchema(name:movie_info.actors, type:array<string>, comment:null), FieldSchema(name:movie_info.showtime, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191115220505_0e4cbd35-3f1c-44d8-8a03-2b7dcd7ab03e); Time taken: 0.09 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191115220505_0e4cbd35-3f1c-44d8-8a03-2b7dcd7ab03e): select * from movie_info
INFO  : Completed executing command(queryId=hadoop_20191115220505_0e4cbd35-3f1c-44d8-8a03-2b7dcd7ab03e); Time taken: 0.0 seconds
INFO  : OK
+-----------------------+--------------------------+----------------------+--+
| movie_info.moviename  |    movie_info.actors     | movie_info.showtime  |
+-----------------------+--------------------------+----------------------+--+
| 战狼2                   | ["吴京","吴刚","卢婧姗"]        | 2017-08-16           |
| 大话西游                  | ["周星驰","吴孟达"]            | 1995-09-01           |
| 哪吒                    | ["吕艳婷","瀚墨"]             | 2019-07-26           |
| 使徒行者2                 | ["张家辉","古天乐","吴镇宇"]      | 2019-08-07           |
| 鼠胆英雄                  | ["岳云鹏","佟丽娅","田雨","袁弘"]  | 2019-08-02           |
+-----------------------+--------------------------+----------------------+--+
5 rows selected (0.177 seconds)

3.1 查询出每个电影的第二个主演

0: jdbc:hive2://node01:10000> select moviename,actors[1] from movie_info;
INFO  : Compiling command(queryId=hadoop_20191115220909_702e25ac-bbf9-4fb3-a39e-44156379fcf3): select moviename,actors[1] from movie_info
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:moviename, type:string, comment:null), FieldSchema(name:_c1, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191115220909_702e25ac-bbf9-4fb3-a39e-44156379fcf3); Time taken: 0.134 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191115220909_702e25ac-bbf9-4fb3-a39e-44156379fcf3): select moviename,actors[1] from movie_info
INFO  : Completed executing command(queryId=hadoop_20191115220909_702e25ac-bbf9-4fb3-a39e-44156379fcf3); Time taken: 0.0 seconds
INFO  : OK
+------------+------+--+
| moviename  | _c1  |
+------------+------+--+
| 战狼2        | 吴刚   |
| 大话西游       | 吴孟达  |
| 哪吒         | 瀚墨   |
| 使徒行者2      | 古天乐  |
| 鼠胆英雄       | 佟丽娅  |
+------------+------+--+

3.2 查询每部电影有几名主演

0: jdbc:hive2://node01:10000> select moviename,size(actors) as actorcount from movie_info;
INFO  : Compiling command(queryId=hadoop_20191115221010_1b46f1ce-12bf-406d-9cfc-395df7f26816): select moviename,size(actors) as actorcount from movie_info
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:moviename, type:string, comment:null), FieldSchema(name:actorcount, type:int, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191115221010_1b46f1ce-12bf-406d-9cfc-395df7f26816); Time taken: 0.075 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191115221010_1b46f1ce-12bf-406d-9cfc-395df7f26816): select moviename,size(actors) as actorcount from movie_info
INFO  : Completed executing command(queryId=hadoop_20191115221010_1b46f1ce-12bf-406d-9cfc-395df7f26816); Time taken: 0.001 seconds
INFO  : OK
+------------+-------------+--+
| moviename  | actorcount  |
+------------+-------------+--+
| 战狼2        | 3           |
| 大话西游       | 2           |
| 哪吒         | 2           |
| 使徒行者2      | 3           |
| 鼠胆英雄       | 4           |
+------------+-------------+--+

3.3主演里面包含古天乐的电影

# 需要使用lateral view
0: jdbc:hive2://node01:10000> Select t.moviename,t.actor from
. . . . . . . . . . . . . . > (
. . . . . . . . . . . . . . > select moviename,actor from movie_info lateral view explode(actors)temp as actor
. . . . . . . . . . . . . . > ) t
. . . . . . . . . . . . . . > Where t.actor='古天乐';
INFO  : Compiling command(queryId=hadoop_20191115222525_532eeeb8-06db-4828-9e6d-73594e31877e): Select t.moviename,t.actor from
(
select moviename,actor from movie_info lateral view explode(actors)temp as actor
) t
Where t.actor='古天乐'
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:t.moviename, type:string, comment:null), FieldSchema(name:t.actor, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191115222525_532eeeb8-06db-4828-9e6d-73594e31877e); Time taken: 0.225 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191115222525_532eeeb8-06db-4828-9e6d-73594e31877e): Select t.moviename,t.actor from
(
select moviename,actor from movie_info lateral view explode(actors)temp as actor
) t
Where t.actor='古天乐'
INFO  : Completed executing command(queryId=hadoop_20191115222525_532eeeb8-06db-4828-9e6d-73594e31877e); Time taken: 0.0 seconds
INFO  : OK
+--------------+----------+--+
| t.moviename  | t.actor  |
+--------------+----------+--+
| 使徒行者2        | 古天乐      |
+--------------+----------+--+

创建表格导入数据练习2

1,张三,18:male:北京
2,李四,29:female:上海
3,杨朝来,22:male:深圳?
4,蒋平,34:male:成都
5,唐灿华,25:female:哈尔滨
6,马达,17:male:北京
7,赵小雪,23:female:杭州
8,薛文泉,26:male:上海
9,丁建,29:male:北京

创建表格,加载数据。

# 创建表格
0: jdbc:hive2://node01:10000> create table dept(id int,name string,info struct<age:int,gender:string,city:string>) row format delimited fields terminated by ',' collection items terminated by ':';
INFO  : Compiling command(queryId=hadoop_20191115223434_1d5c9e49-c4e7-4930-938d-4a5823486099): create table dept(id int,name string,info struct<age:int,gender:string,city:string>) row format delimited fields terminated by ',' collection items terminated by ':'
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191115223434_1d5c9e49-c4e7-4930-938d-4a5823486099); Time taken: 0.008 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191115223434_1d5c9e49-c4e7-4930-938d-4a5823486099): create table dept(id int,name string,info struct<age:int,gender:string,city:string>) row format delimited fields terminated by ',' collection items terminated by ':'
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hadoop_20191115223434_1d5c9e49-c4e7-4930-938d-4a5823486099); Time taken: 0.086 seconds
INFO  : OK
No rows affected (0.131 seconds)
0: jdbc:hive2://node01:10000> desc dept;
INFO  : Compiling command(queryId=hadoop_20191115223434_270cf471-d325-40ed-af27-4afff5a6aabf): desc dept
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:col_name, type:string, comment:from deserializer), FieldSchema(name:data_type, type:string, comment:from deserializer), FieldSchema(name:comment, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191115223434_270cf471-d325-40ed-af27-4afff5a6aabf); Time taken: 0.1 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191115223434_270cf471-d325-40ed-af27-4afff5a6aabf): desc dept
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hadoop_20191115223434_270cf471-d325-40ed-af27-4afff5a6aabf); Time taken: 0.042 seconds
INFO  : OK
+-----------+--------------------------------------------+----------+--+
| col_name  |                 data_type                  | comment  |
+-----------+--------------------------------------------+----------+--+
| id        | int                                        |          |
| name      | string                                     |          |
| info      | struct<age:int,gender:string,city:string>  |          |
+-----------+--------------------------------------------+----------+--+
3 rows selected (0.169 seconds)
# 加载数据
0: jdbc:hive2://node01:10000> load data local inpath '/kkb/install/hivedatas/dept.txt' overwrite into table dept;
INFO  : Compiling command(queryId=hadoop_20191115223535_d81bba57-47cc-473b-be36-fe32053922b8): load data local inpath '/kkb/install/hivedatas/dept.txt' overwrite into table dept
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191115223535_d81bba57-47cc-473b-be36-fe32053922b8); Time taken: 0.026 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191115223535_d81bba57-47cc-473b-be36-fe32053922b8): load data local inpath '/kkb/install/hivedatas/dept.txt' overwrite into table dept
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table db_hive.dept from file:/kkb/install/hivedatas/dept.txt
INFO  : Starting task [Stage-1:STATS] in serial mode
INFO  : Table db_hive.dept stats: [numFiles=1, totalSize=240]
INFO  : Completed executing command(queryId=hadoop_20191115223535_d81bba57-47cc-473b-be36-fe32053922b8); Time taken: 0.303 seconds
INFO  : OK
# 查询数据
0: jdbc:hive2://node01:10000> select * from dept;
INFO  : Compiling command(queryId=hadoop_20191115223737_b10807f4-8f05-4b3b-aa33-4e0bc8c408f6): select * from dept
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:dept.id, type:int, comment:null), FieldSchema(name:dept.name, type:string, comment:null), FieldSchema(name:dept.info, type:struct<age:int,gender:string,city:string>, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191115223737_b10807f4-8f05-4b3b-aa33-4e0bc8c408f6); Time taken: 0.054 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191115223737_b10807f4-8f05-4b3b-aa33-4e0bc8c408f6): select * from dept
INFO  : Completed executing command(queryId=hadoop_20191115223737_b10807f4-8f05-4b3b-aa33-4e0bc8c408f6); Time taken: 0.0 seconds
INFO  : OK
+----------+------------+--------------------------------------------+--+
| dept.id  | dept.name  |                 dept.info                  |
+----------+------------+--------------------------------------------+--+
| 1        | 张三         | {"age":18,"gender":"male","city":"北京"}     |
| 2        | 李四         | {"age":29,"gender":"female","city":"上海"}   |
| 3        | 杨朝来        | {"age":22,"gender":"male","city":"深圳?"}    |
| 4        | 蒋平         | {"age":34,"gender":"male","city":"成都"}     |
| 5        | 唐灿华        | {"age":25,"gender":"female","city":"哈尔滨"}  |
| 6        | 马达         | {"age":17,"gender":"male","city":"北京"}     |
| 7        | 赵小雪        | {"age":23,"gender":"female","city":"杭州"}   |
| 8        | 薛文泉        | {"age":26,"gender":"male","city":"上海"}     |
| 9        | 丁建         | {"age":29,"gender":"male","city":"北京"}     |
+----------+------------+--------------------------------------------+--+

4.1 查询出每个人的id,名字,居住地址

0: jdbc:hive2://node01:10000> select id,name,info.city from dept;
INFO  : Compiling command(queryId=hadoop_20191115224040_796610ca-d24a-48ac-a8aa-d1eaee1ddce9): select id,name,info.city from dept
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:int, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:city, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20191115224040_796610ca-d24a-48ac-a8aa-d1eaee1ddce9); Time taken: 0.081 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20191115224040_796610ca-d24a-48ac-a8aa-d1eaee1ddce9): select id,name,info.city from dept
INFO  : Completed executing command(queryId=hadoop_20191115224040_796610ca-d24a-48ac-a8aa-d1eaee1ddce9); Time taken: 0.001 seconds
INFO  : OK
+-----+-------+-------+--+
| id  | name  | city  |
+-----+-------+-------+--+
| 1   | 张三    | 北京    |
| 2   | 李四    | 上海    |
| 3   | 杨朝来   | 深圳?   |
| 4   | 蒋平    | 成都    |
| 5   | 唐灿华   | 哈尔滨   |
| 6   | 马达    | 北京    |
| 7   | 赵小雪   | 杭州    |
| 8   | 薛文泉   | 上海    |
| 9   | 丁建    | 北京    |
+-----+-------+-------+--+

以上为hive练习部分,记录一下。

以上是关于Hive常用算子实现原理简述--MapReduce版的主要内容,如果未能解决你的问题,请参考以下文章

Hive架构简述及工作原理

Hive基础练习一

Hive mapreduce SQL实现原理——SQL最终分解为MR任务,而group by在MR里和单词统计MR没有区别了

Hive架构原理和性能优化

什么是MapReduce?

Hive的配置| 架构原理