结巴分词
Posted demonxian3
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了结巴分词相关的知识,希望对你有一定的参考价值。
#!coding: utf-8 import jieba import jieba.posseg as pseg import jieba.analyse as anal from optparse import OptionParser usage = "usage: python %prog [--tfidf topK] [--textr topK]"; parser = OptionParser(usage); parser.add_option("--tag", dest="tag", action="store_true"); parser.add_option("--fast", dest="fast", action="store_true"); parser.add_option("--tfidf", dest="tfidf"); parser.add_option("--textr", dest="textr"); opt,args = parser.parse_args(); txt = "支持三种分词模式: 精确模式,试图将句子最精确地切开,适合文本分析; 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义; 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。 支持繁体分词 支持自定义词典 MIT 授权协议 在线演示"; #multiprocess if opt.fast: jieba.enable_parallel(10); #define word-dict jieba.add_word("全模式"); jieba.suggest_freq(("协","议"), True) ; #generator print "/".join(jieba.cut(txt)); #list print "/".join(jieba.lcut(txt)); #search mode print "/".join(jieba.cut_for_search(txt)); #get word‘s position res = jieba.tokenize(txt.decode("utf-8")); #res = jieba.tokenize(txt.decode("utf-8"), mode="search"); #search mode print "word start end"; for tk in res: print("%s %d %d" % (tk[0],tk[1],tk[2])); #tagging word if opt.tag: for w,k in pseg.cut(txt): print w+"("+k+")", #tfidf sort keyword if opt.tfidf: topK = int(opt.tfidf); tags = anal.extract_tags(txt, topK, withWeight=True); for word,weight in tags: print word,weight #textrank sort keyword if opt.textr: topk = int(opt.textr); tags = anal.textrank(txt, topk, withWeight=True); for word,weight in tags: print word,weight;
以上是关于结巴分词的主要内容,如果未能解决你的问题,请参考以下文章