hive ngram使用什么分隔符来标记化?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了hive ngram使用什么分隔符来标记化?相关的知识,希望对你有一定的参考价值。
我正在进行一些情绪分析。
我需要在文本中计算词汇(不同的单词)。
ngram UDF似乎在确定unigrams方面做得很好。我想知道它用来确定unigrams / tokens的分隔符。如果我想使用拆分UDF来模拟词汇计数,这很重要。例如,给出以下文本(产品评论)
看到1盎司真的是多少,我感到非常震惊。 7.60美元,我错误地认为这将是一个体面的大小罐头。在当地,我能够以3美元左右的价格购买中等大小的芥末酱,但从未使用过它,因此它会变老。我认为粉末会更好,所以我可以根据需要混合它。当我打开盒子并通过包装挖出来并看到这个小小的罐头时,我开始寻找隐藏的相机...认为这是一个笑话。没有......而且它也不可退货。所以我已经学会了我的教训。请注意你是否应该决定要使用这种昂贵的芥末粉。
ngram UDG计算82个unigrams /令牌
SELECT count(*) FROM
(SELECT explode(ngrams(sentences(upper(reviewtext)),1,9999999))
FROM amazon.Food_review_part_small WHERE asin = 'B0000CNU1X' AND reviewerid ='A1UCAVBNJUZMPR') t;
82
但是,使用带有空格,逗号,句点,连字符和双引号作为分隔符的拆分UDF,有85个unigrams / tokens
select count(distinct(te)) FROM amazon.Food_review_part_small
lateral view explode(split(upper(reviewtext), '[\s,.-]|"')) t as te
WHERE te <> '' AND asin = 'B0000CNU1X' AND reviewerid ='A1UCAVBNJUZMPR';
85
当然,我找不到任何文件。有谁知道ngram UDF用来确定unigram令牌的分隔符?
UDAF ngram不分割数据,实际上它已经期望一个字符串数组或一个字符串数组数组作为输入。在这种情况下,UDF sentences分割数据,来自java评论:
+ "Unnecessary punctuation, such as periods and commas in English, is automatically stripped."
+ " If specified, 'lang' should be a two-letter ISO-639 language code (such as 'en'), and "
+ "'country' should be a two-letter ISO-3166 code (such as 'us'). Not all country and "
+ "language codes are fully supported, and if an unsupported code is specified, a default "
如果您运行以下查询
select sentences("I was aboslutely shocked to see how much 1 oz really was. At $7.60, I mistakenly assumed it would be a decent sized can. As locally I am able to buy a medium sized tube of wasabi paste for around $3, but never used it fast enough so it would get old. I figured a powder would be better, so I can mix it as I needed it. When I opened the box and dug thru the packing and saw this little little can, I started looking for the hidden cameras ... thought this HAD to be a joke. Nope .. and it's NOT returnable either. SO I HAVE LEARNED MY LESSON. Please just be aware if you should decide you want this EXPENSIVE wasabi powder.");
你会得到以下结果
[["I","was","aboslutely","shocked","to","see","how","much","1","oz","really","was"],["At","I","mistakenly","assumed","it","would","be","a","decent","sized","can"],["As","locally","I","am","able","to","buy","a","medium","sized","tube","of","wasabi","paste","for","around","but","never","used","it","fast","enough","so","it","would","get","old"],["I","figured","a","powder","would","be","better","so","I","can","mix","it","as","I","needed","it"],["When","I","opened","the","box","and","dug","thru","the","packing","and","saw","this","little","little","can","I","started","looking","for","the","hidden","cameras","thought","this","HAD","to","be","a","joke"],["Nope","and","it's","NOT","returnable","either"],["SO","I","HAVE","LEARNED","MY","LESSON"],["Please","just","be","aware","if","you","should","decide","you","want","this","EXPENSIVE","wasabi","powder"]]
正如你所看到的,udf句子正在删除一些“噪音”,如“$ 7.60”,“$ 3”也是空字符串。
以上是关于hive ngram使用什么分隔符来标记化?的主要内容,如果未能解决你的问题,请参考以下文章
如何使用 sklearn CountVectorizer and() 来获取包含任何标点符号作为单独标记的 ngram?