如何仅使用 Hive 中的 regexp_extract 函数提取标点符号

Posted 2023-03-23

技术标签:

【中文标题】如何仅使用 Hive 中的 regexp_extract 函数提取标点符号【英文标题】：How to extract punctuations only using regexp_extract function in Hive 【发布时间】：2017-08-06 14:28:14 【问题描述】：

我在 hive 中学习 regexp_extract 函数假设我有表 'A' 和列 'word'，

A word Hello! world, how are you?

我只想提取标点符号，以便输出是，

! , ?

如何使用 regexp_extract 执行此操作我尝试如下但没有得到所需的输出，

select regexp_extract(word,"[^A-Za-z0-9]*","1") from A;

请指导！

【问题讨论】：

强制解决方案（“如何使用 regexp_extract”）不是一个好主意。 【参考方案1】：

hive> with A as (select explode(array('word','Hello!','world,','how','are','you?')) as word)
    > select  regexp_extract(word,'\\pPunct',0) as Punct
    > from    A
    > ;
OK
punct

!
,


?

【讨论】：

嘿嘟嘟你又来了，我的问题解决了。谢谢！ Dudu，您能否推荐任何简单的文档或网站，我可以根据要求了解使用哪种正则表达式模式。 Hive 使用 Java 正则表达式 docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html P.s. (1) REGEXP_EXTRACT 的第三个参数是捕获组的索引（被括号包围的匹配模式的一部分）或整个模式的0。 (2) * 代表零个或多个，因此您的模式匹配零长度字符串。

以上是关于如何仅使用 Hive 中的 regexp_extract 函数提取标点符号的主要内容，如果未能解决你的问题，请参考以下文章