python中文分词+词频统计

Posted 2023-03-28 爱吃糖的月妖妖

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python中文分词+词频统计相关的知识，希望对你有一定的参考价值。

文章目录

目录

文章目录

前言

一、文本导入

二、使用步骤

1.引入库

2.读入数据

         3.取出停用词表

         3.分词并去停用词（此时可以直接利用python原有的函数进行词频统计）

        4. 输出分词并去停用词的有用的词到txt

        5.函数调用

         6.结果

        总结

前言

本文记录了一下Python在文本处理时的一些过程+代码

一、文本导入

我准备了一个名为abstract.txt的文本文件

接着是在网上下载了stopword.txt(用于结巴分词时的停用词)

有一些是自己觉得没有用加上去的

另外建立了自己的词典extraDict.txt

准备工作做好了，就来看看怎么使用吧！

二、使用步骤

1.引入库

代码如下：

import jieba
from jieba.analyse import extract_tags
from sklearn.feature_extraction.text import TfidfVectorizer

2.读入数据

代码如下：

jieba.load_userdict('extraDict.txt')  # 导入自己建立词典

3.取出停用词表

def stopwordlist():
    stopwords = [line.strip() for line in open('chinesestopwords.txt', encoding='UTF-8').readlines()]
    # ---停用词补充,视具体情况而定---
    i = 0
    for i in range(19):
        stopwords.append(str(10 + i))
    # ----------------------

    return stopwords

4.分词并去停用词（此时可以直接利用python原有的函数进行词频统计）

def seg_word(line):
    # seg=jieba.cut_for_search(line.strip())
    seg = jieba.cut(line.strip())
    temp = ""
    counts = 
    wordstop = stopwordlist()
    for word in seg:
        if word not in wordstop:
            if word != ' ':
                temp += word
                temp += '\\n'
                counts[word] = counts.get(word, 0) + 1#统计每个词出现的次数
    return  temp #显示分词结果
    #return str(sorted(counts.items(), key=lambda x: x[1], reverse=True)[:20])  # 统计出现前二十最多的词及次数

5. 输出分词并去停用词的有用的词到txt

def output(inputfilename, outputfilename):
    inputfile = open(inputfilename, encoding='UTF-8', mode='r')
    outputfile = open(outputfilename, encoding='UTF-8', mode='w')
    for line in inputfile.readlines():
        line_seg = seg_word(line)
        outputfile.write(line_seg)
    inputfile.close()
    outputfile.close()
    return outputfile

6.函数调用

if __name__ == '__main__':
    print("__name__", __name__)
    inputfilename = 'abstract.txt'
    outputfilename = 'a1.txt'
    output(inputfilename, outputfilename)

7.结果

总结

以上就是今天要讲的内容，本文仅仅简单介绍了python的中文分词及词频统计，欢迎指正！

python进行分词及统计词频

#!/usr/bin/python
# -*- coding: UTF-8 -*-
#分词统计词频
import jieba
import re
from collections import Counter
content=""
filename=r"../data/commentText.txt";
result = "result_com.txt"
r=‘[0-9\s+\.\!\/_,$%^*()?;；:-【】+\"\‘]+|[+——！，;：。？、 ~@#￥%……&*（）]+‘
with open(filename,‘r‘,encoding=‘utf-8‘) as fr:
    print("ss")
    content=re.sub(r," ",fr.read())
    #re.sub(pattern, repl, string, count=0, flags=0)
    # pattern：表示正则表达式中的模式字符串；
    # repl：被替换的字符串（既可以是字符串，也可以是函数）；
    # string：要被处理的，要被替换的字符串；
    # count：匹配的次数, 默认是全部替换
    # flags：具体用处不详
    data=jieba.cut(content,cut_all=False)

data=dict(Counter(data))#dict() 函数用于创建一个字典。Counter 是实现的 dict 的一个子类，可以用来方便地计数。
with open(result,‘w‘,encoding="utf-8")as fw:
     for k,v in data.items():
         if(len(k)>1):
            fw.write(k)
            fw.write("\t%d\n"%v)

　　语言：Python3.7 包：jieba counter re

　　出错内容：由于没有在写入文件中规定其编码方式，导致为16进制写入，设置编码方式即可

以上是关于python中文分词+词频统计的主要内容，如果未能解决你的问题，请参考以下文章

python中文分词+词频统计

python词频统计

python进行分词及统计词频

Python之酒店评论分词词性标注TF-IDF词频统计词云

运用jieba库进行词频统计