jieba分词的词性表
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了jieba分词的词性表相关的知识,希望对你有一定的参考价值。
参考技术A jieba分词的普通分词用jieba.cut函数,分词并进行词性标注用jieba.posseg.cut函数, 官网 示例如下:jieba使用的词性标注表如下:
python jieba分词如何去除停用词
-*- coding: utf-8 -*-import jieba
import jieba.analyse
import sys
import codecs
reload(sys)
sys.setdefaultencoding('utf-8')
#使用其他编码读取停用词表
#stoplist = codecs.open('../../file/stopword.txt','r',encoding='utf8').readlines()
#stoplist = set(w.strip() for w in stoplist)
#停用词文件是utf8编码
stoplist = .fromkeys([ line.strip() for line in open("../../file/stopword.txt") ])
#经过分词得到的应该是unicode编码,先将其转成utf8编码 参考技术A import jieba
# 创建停用词list
def stopwordslist(filepath):
stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
return stopwords
# 对句子进行分词
def seg_sentence(sentence):
sentence_seged = jieba.cut(sentence.strip())
stopwords = stopwordslist('./test/stopwords.txt') # 这里加载停用词的路径
outstr = ''
for word in sentence_seged:
if word not in stopwords:
if word != '\\t':
outstr += word
outstr += " "
return outstr
inputs = open('./test/input.txt', 'r', encoding='utf-8')
outputs = open('./test/output.txt', 'w')
for line in inputs:
line_seg = seg_sentence(line) # 这里的返回值是字符串
outputs.write(line_seg + '\\n')
outputs.close()
inputs.close()
以上是关于jieba分词的词性表的主要内容,如果未能解决你的问题,请参考以下文章