HIVE 常见函数

Posted alpha-cat

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了HIVE 常见函数相关的知识,希望对你有一定的参考价值。

函数
1.hive函数分类
    标准函数 自定义标准函数称之为 UDF   传递一行数据,返回一个结果

    聚合函数 自定义聚合函数称之为 UDAF  传递多行数据,返回一个结果 group by   sum count

    表生成函数 自定义表生成函数称之为 UDTF  传递一行数据,返回多行数据  explode


2.函数帮助文档
    SHOW FUNCTIONS;
    显示当前环境所有的函数

    DESC FUNCTION length;
    显示帮助函数文档

    DESC FUNCTION EXTENDED length;
    显示函数额外信息


3.数学函数
    round  求四舍五入

    rand  随机

    conv   进制转换函数
    conv(num, from_base, to_base)  num数据    from_base 指定的进制    to_base 转换进制

    cast  类型转换  select cast(id as STRING) from student;  把int类型转化成STRING


4.日期函数
    from_unixtime   把时间戳转化成字符串
    select from_unixtime(1567995900, yyyyMMdd HHmmss)

    unix_timestamp  把字符串转化时间戳
    select unix_timestamp(2011-12-10 13:10:12);

5.条件函数
    IF(expr1,expr2,expr3) !!!
    expr1 如果expr1为真 输出expr2   否则就输出expr3
    SELECT name, age, if(age > 20, A, B) as type FROM student;


    返回第一个非空数据 #
    SELECT coalesce(null, 12, 3,14,null);


    CASE  !!!
    给数据分组

    SELECT id, CASE   
               WHEN id <=2 THEN a 
               WHEN id > 2 AND id <=10 THEN b 
               ELSE c  
               END  FROM student;


    SELECT mycol, sum(id)FROM (SELECT id, (CASE  WHEN id <=2 THEN a WHEN id > 2 AND id <=10 THEN b ELSE c  END)  as mycol FROM student) t GROUP BY mycol
               

    SELECT CASE id  WHEN 2 THEN a   ELSE b END from student;

//////////////////////////////////////////////////////////////////////
6.字符串处理函数
    字符串拼接
    concat    拼接任何类型,中间没有分隔符指定
    select id, concat(name, age) from student;

    concat_ws 只能拼接字符串类型,可以指定分隔符
    select concat_ws(_, asdfasd, 18) ;


    instr(string str, string substr)
    select  instr(candle, dle)

    length  字符串大小
    lower  小写
    lcase  大写


    正则表达式 #
    regexp_extract  通过正则表达式查找
    regexp_extract(str, regexp[, idx])

    参数
    str 需要处理的字符串
    regexp 正则表达式 主要用于匹配
    idx   索引
          0 整个字符串匹配
          1 匹配正则表达式中第一个括号
          2 匹配正则表达式第二个括号

     select regexp_extract(x=asdfasdf12345asdf, x=([a-z]+)([0-9]+)([a-z]+), 3);



    regexp_replace  。。。。。。。替换
     select regexp_replace(x=asdfasdf12345asdf3456345, ([0-9]+),xxxx );


    URL #
    parse_url   专门用来解析URL数据
    http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere

    select parse_url(http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere,  REF);

    [HOST 域名,PATH 路径,QUERY 查询,REF 锚点,PROTOCOL 协议,FILE,AUTHORITY IP地址+端口,USERINFO]

    输出结果
        hive (hadoop)> select parse_url(http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere,  HOST);
        OK
        _c0
        www.candle.com
        Time taken: 0.07 seconds, Fetched: 1 row(s)

        hive (hadoop)> select parse_url(http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere,  PATH);
        OK
        _c0
        /dir1/dir2/dir3/file.html
        Time taken: 0.093 seconds, Fetched: 1 row(s)

        hive (hadoop)> select parse_url(http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere,  QUERY);
        OK
        _c0
        key1=value1&key2=value2
        Time taken: 0.051 seconds, Fetched: 1 row(s)

        hive (hadoop)> select parse_url(http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere,  QUERY, key1);
        OK
        _c0
        value1
        Time taken: 0.105 seconds, Fetched: 1 row(s)

        hive (hadoop)> select parse_url(http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere,  REF);
        OK
        _c0
        imhere



    JSON  #

    {"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net","owner":"amy"}

    
    {
        "store":
            {
                "fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
                "bicycle":{"price":19.95,"color":"red"}
            } ,


        "email":"amy@only_for_json_udf_test.net",
        "owner":"amy"
    }


    hive (hadoop)> select get_json_object({"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net","owner":"amy"}, $.store.bicycle.price);
    OK
    _c0
    19.95


    hive (hadoop)> select get_json_object({"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net","owner":"amy"}, $.store.fruit[0].type);
    OK
    _c0
    apple




7.宽表和长表转化  !!!
    explode 表生成函数

    hive (hadoop)> select explode(city) from student1;

    错误:
    hive (hadoop)> select id, name, explode(city) from student1;
    FAILED: SemanticException [Error 10081]: UDTFs are not supported outside the SELECT clause, nor nested in expressions

    数组
        hive 中 查询,表生成函数不能直接和普通列直接查询
        需要先生成一个临时视图 拼接起来
        select id, name, mycity from student1 lateral view  explode(city) tmp_view AS mycity;

        SELECT id, name, mycity FROM student1 LATERAL VIEW explode(city) 临时视图名称 AS 自定义列名


    键值对
        键值对 explode生成两列 key value

        > select id, name , mykey, myvalue from student3 LATERAL VIEW explode(pairs) tmp_view AS mykey, myvalue;

        OK
        id    name    mykey    myvalue
        1    candle    k1    v1
        1    candle    k2    v2
        2    jack    k1    v1
        2    jack    k2    v2
        3    tom    k1    v1

    结构体:
        多列
        hive (hadoop)> select explode(addr) from student2;
        FAILED: UDFArgumentException explode() takes an array or a map as a parameter

        注意:
        结构体成员不能直接转化成多列,explode只接受arry和map
        其实,把结构体每个成员转化成每一行 没有意义

        一般再处理结构体采用如下形式
        hive (hadoop)> select id,name, addr.city, addr.area, addr.streetid from student2;
        OK
        id    name    city    area    streetid
        1    candle    shanghai    minhang    35
        2    jack    beijing    haidian    100
        3    tom    hefei    shushan    103



        一般不采用如下格式
        hive (hadoop)> select id, name, info from student2 lateral view explode(array(addr.city, addr.area, addr.streetid)) tmp_view as info;
        OK
        id    name    info
        1    candle    shanghai
        1    candle    minhang
        1    candle    35
        2    jack    beijing
        2    jack    haidian
        2    jack    100
        3    tom    hefei
        3    tom    shushan
        3    tom    103


    长转宽  聚合函数 group by
      collect_list  多行数据 合并到一行 生成数组 list 允许成员重复
      select age, collect_list(name),  collect_list(id) from student group by age;

      age    _c1    _c2
      18    ["candle","candle"]    [1,1]
      19    ["aa","jack"]    [10,2]
      20    ["c2","tom"]    [1,4]
      100    ["cc"]    [12]
      200    ["dd"]    [13]

  
      collect_set  不允许重复

      select age, collect_set(name),  collect_set(id) from student group by age;

      age    _c1    _c2
        18    ["candle"]    [1]
        19    ["aa","jack"]    [10,2]
        20    ["c2","tom"]    [1,4]
        100    ["cc"]    [12]
        200    ["dd"]    [13]


        select map(id, name) from student; 生成键值对
        select struct(id, name, age) from student; 生成结构体


8.窗口函数 !!!
    分组排序

    1) 创建表格
       CREATE TABLE windows
       (
       id INT,
       class STRING,
       score INT
       )
       ROW FORMAT DELIMITED FIELDS TERMINATED BY ,
       LINES TERMINATED BY 

       STORED AS TEXTFILE;

    2) row_number
        partition by  指定分组列
        order by   指定排序列
        SELECT class, score, row_number() over(partition by class order by score desc)  from windows;


        排名  不会因为score相同改变次序
        class    score    paiming
        class1    78    1
        class1    78    2
        class1    66    3
        class1    66    4
        class1    66    5
        class1    56    6
        class1    56    7
        class1    56    8
        class1    56    9
        class1    56    10


    3) rank

        排名  如果score重复,排序也重复,下一个名词会跳过重复的部分 

        SELECT class, score, rank() over(partition by class order by score desc)  from windows;
        class    score    _wcol0
        class1    78    1
        class1    78    1
        class1    66    3
        class1    66    3
        class1    66    3
        class1    56    6
        class1    56    6
        class1    56    6
        class1    56    6
        class1    56    6


    4) dense_rank

        如果score重复,排序也重复,下一个名词不会跳过重复的部分
        SELECT class, score, dense_rank() over(partition by class order by score desc)  from windows;

        class    score    _wcol0
        class1    78    1
        class1    78    1
        class1    66    2
        class1    66    2
        class1    66    2
        class1    56    3
        class1    56    3
        class1    56    3
        class1    56    3
        class1    56    3

    5) 分组 topn模型
    不能直接筛选

    hive (hadoop)> SELECT class, score, row_number() over(partition by class order by score desc) as paiming  from windows WHERE paiming <= 5;


    FAILED: SemanticException [Error 10004]: Line 1:110 Invalid table alias or column reference paiming: (possible column names are: id, class, score)


    使用子查询
    SELECT class, score from 
        (SELECT class, score, row_number() over(partition by class order by score desc) as paiming  from windows)  t WHERE paiming <=5;





    

 

以上是关于HIVE 常见函数的主要内容,如果未能解决你的问题,请参考以下文章

Hive常见内置函数及其使用

Hive函数入门--案例:UDF实现手机号加密--代码实现与效果演示

常见的代码片段

Alfred常见使用

hive窗口函数总结

hive关于窗口函数的使用