从 Hive 数组中选择特定值
Posted
技术标签:
【中文标题】从 Hive 数组中选择特定值【英文标题】:Select specific value from Hive array 【发布时间】:2016-10-08 11:19:39 【问题描述】:我在 Hive 中有一个表,其结构如下 3 列;
timestamp UserID OtherId
2016-09-01 123 "101","222","321","987","393.1","090","467","863"
2016-09-01 124 "188","389","673","972","193","100","143","210"
2016-09-01 125 "888","120","482","594","393.2"
2016-09-01 126 "441","501","322","671","008","899"
2016-09-01 127 "004","700","393.4","761","467","356","643","578"
2016-09-01 128 "322","582","348"
2016-09-01 129 "029","393.8","126","187"
OtherID 是一个数组。
我需要解析 OtherID 以便生成的数据集如下,因为我只对包含 '393%' 的值感兴趣
timestamp UserID OtherId
2016-09-01 123 393.1
2016-09-01 125 393.2
2016-09-01 127 393.4
2016-09-01 129 393.8
我研究了大量的解析函数,但似乎它们都是为了返回值的位置,或者您需要指定值的位置才能返回它。这两个选项在这里都不起作用,因为对于任何给定的行,'3309%' 可以出现在数组中的任何点。 还有一个事实是我需要合并通配符以允许我想要的值的变化。
另一个选项是爆炸,但我的表对于该选项来说太大了。
我认为 UDF 可能是唯一的出路,但我会欢迎那里提供一些指导。
感谢您的帮助。
【问题讨论】:
你能试试这样的吗?SELECT * FROM table WHERE OtherId RLIKE regexp_extract(OtherId, '(\"393\.\d\")', 1)
感谢您的建议。给出以下错误:“编译语句时出错:失败:SemanticException [错误 10014]:第 2:19 行错误的参数 ''(\"393\.\d\")'':类 org.apache.hadoop 没有匹配方法.hive.ql.udf.UDFRegExpExtract with (array"101,222,321"
或者它们之间有空格吗?根据这一点,我必须稍微修改一下正则表达式
【参考方案1】:
使用 hive 中提供的横向视图选项可以轻松完成您需要的工作。
0: jdbc:hive2://quickstart:10000/default> select * from test_5;
+-----------+------------+----------------------------------------------+
| test_5.t | test_5.id | test_5.oid |
+-----------+------------+----------------------------------------------+
| 123 | 123 | "222","321","987","393.1","090","467","863" |
+-----------+------------+----------------------------------------------+
这就是诀窍:
SELECT id, ooid
FROM test_5
LATERAL VIEW EXPLODE(SPLIT(oid,",")) temp AS ooid;
+------+----------+
| id | ooid |
+------+----------+
| 123 | "222" |
| 123 | "321" |
| 123 | "987" |
| 123 | "393.1" |
| 123 | "090" |
| 123 | "467" |
| 123 | "863" |
+------+----------+
尔格:
SELECT id, regexp_replace(ooid,'"','')
FROM test_5
LATERAL VIEW EXPLODE(SPLIT(oid,",")) temp AS ooid;
WHERE ooid LIKE '"393%';
+------+----------+
| id | ooid |
+------+----------+
| 123 | 393.1 |
+------+----------+
【讨论】:
【参考方案2】:也许你可以尝试如下:
hive> select timestamp1, userid, otherids from userdet1 LATERAL VIEW explode(otherid) testTable as otherids where otherids LIKE concat('393','%');
好的
2016-09-01 123 393.1
2016-09-01 125 393.2
2016-09-01 127 393.4
2016-09-01 129 393.8
Time taken: 0.297 seconds, Fetched: 4 row(s)
【讨论】:
以上是关于从 Hive 数组中选择特定值的主要内容,如果未能解决你的问题,请参考以下文章
Hive查询:根据条件选择一列,另一列值匹配某些特定值,然后将匹配结果创建为新列
从 S3 读取大型 JSON 文件 (3K+) 并从数组中选择特定键