蜂巢句子功能如何打破每个句子

Posted

技术标签:

【中文标题】蜂巢句子功能如何打破每个句子【英文标题】:How hive sentences function breaks each sentence 【发布时间】:2017-01-04 15:19:31 【问题描述】:

在发布之前,我尝试了蜂巢句子功能并进行了一些搜索但无法明确理解,我的问题是基于分隔符蜂巢句子功能打破每个句子?蜂巢手册说“适当的边界”是什么意思?下面是我的尝试示例,我尝试在句子的不同点添加句点 (.) 和感叹号 (!)。我得到不同的输出,有人可以解释一下吗?

带句点 (.)

select sentences('Tokenizes a string of natural language text into words and sentences. where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable

输出 - 1 个数组

[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences","where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

带有“!”

select sentences('Tokenizes a string of natural language text into words and sentences! where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable

输出 - 2 个数组

[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

【问题讨论】:

【参考方案1】:

如果您了解 sentences() 的功能......它会消除您的疑问。

句子定义(str):

将str拆分成句子数组,其中每个句子都是一个数组 字数。

例子:

SELECT sentences('Hello there! I am a UDF.') FROM src LIMIT 1;

[ ["Hello", "there"], ["I", "am", "a", "UDF"] ]



SELECT sentences('review . language') FROM movies;

[["review","language"]]

感叹号是句末的一种标点符号。相关标点符号的其他示例包括句号和问号,它们也位于句子的末尾。但是根据 sentences() 的定义,不必要的标点符号,例如 中的句点和逗号英语,会被自动剥离。所以,我们可以得到两个带有 ! 的单词数组。完全涉及java.util.Locale.java

【讨论】:

这根本没有帮助。他的问题是为什么它没有在期间分裂?【参考方案2】:

我不知道实际原因,但在句点(。)之后观察到,如果您将空格和下一个单词的第一个字母作为大写字母,那么它就可以工作。 在这里,我从工作的地方更改为工作的地方。但是,这不是必需的!

Tokenizes a string of natural language text into words and sentences. Where each sentence is broken at the appropriate sentence boundary and returned as an array of words.

这是下面的输出

[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["Where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

【讨论】:

以上是关于蜂巢句子功能如何打破每个句子的主要内容,如果未能解决你的问题,请参考以下文章

python实现句子反转

如何在 CountVectorizer 中对句子应用权重(计算每个句子标记数次)

如何让 Python 循环遍历句子,但它给了我句子中的单词而不是句子中的每个字母? [复制]

在长文本中自动换行 - 照顾句子

Android句子填空功能

如何使用“。”,“?”,“!”计算C中的句子数?