不需要安装C++也可以用Python制作漂亮的词云

Posted 但老师

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了不需要安装C++也可以用Python制作漂亮的词云相关的知识,希望对你有一定的参考价值。

文章目录


背景

需要制作词云,一般来讲是用Python里面的jieba做分词,然后WordCloud生成词云。为什么要有这篇,因为WordCloud这个模块依赖于C++,涉及到微软的环境安装,真的是一言难尽
报错一般如下

Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> wordcloud

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

但是实际是,不需要用WordCloud也是可以做词云的,还非常简单。
我们用pyecharts


环境

  • Python 3.8
    • jieba
    • pyecharts
  • Windows 10

步骤

个人制作词云的步骤大致如下

提取数据=> 清洗=> 自定义词停用词=> 切词=> 观察=> 词云

这中间会经历从 观察回到 自定义词停用词,然后重复上述步骤好多次,因为要不断完善词语

*停用词的意思就是应该要删除的字符


代码

词云核心代码

这一步可以参照pyecharts的官方文档

Wordcloud - Wordcloud_custom_mask_image

提供

  • 背景图片本地路径
  • 渲染后html文件本地存放路径

这两即可,无需按照教程里面提取base64编码

htmlpath = r'D:\\OneDrive\\桌面\\test.html'
bgpath   = r'D:\\OneDrive\\桌面\\sunlands3.png'
c = (
    WordCloud()
    .add(
        series_name = '词云',
        data_pair   = frequence,
        mask_image  = bgpath
    )
    .render(htmlpath)
)

注:frequence是我的词频变量,形如[('学校', 463), ('推荐', 459), ('可以', 429)...]


完整代码

步骤按照上面步骤来的

下面是测试代码,适合喜欢逐步写代码的同学,也适合在notebook里面跑

from Udfs.cls_sql_new import mysql	# 自己组装的SDK,能从数据库取数就行
from pyecharts.charts import WordCloud
import jieba
import re

# 数据库取数
sql = "select discuss from sd_business.tmp_discuss_no_reserve_kaoyan"
res = mysql().exec(sql)	

# 用正则剔除艾特别人的字符
pat = r'@.*\\s'
wordlist = [re.sub(pat,'',x[0]) for x in res]

# 自定义词典
filepath = r'D:\\OneDrive\\doc\\for_share\\Python_eco\\python\\jieba.txt'
jieba.load_userdict(filepath)

# 切词
cutword = [list(jieba.cut(x)) for x in wordlist]

# 停用词
stopword = ['和','哈','哦',',','[',']','?','也','的','么','.','了',' ','啊','呀','呢',':',':','…','~','噢','。','+','!','额','@','-','?','*','丿','—','\\n','“','吗','是','有','我','小']

# 词频统计&再清洗
dic = 
for c in cutword:
    for w in c:
        if not w in stopword and not w.isdigit() and len(w) > 1:
            if w in dic:
                dic[w] += 1 
            else:
                dic[w] = 1

# 字典转化成list[tuple]
wordfreq = [(k,v) for k,v in dic.items()]

# 词频降序
wordfreSort = sorted(wordfreq,key=lambda x:x[1],reverse=True)

# 词太多了 取前300个词就好
frequence = wordfreSort[:300]

# 渲染词云
htmlpath = r'D:\\OneDrive\\桌面\\test.html'
bgpath = r'D:\\OneDrive\\桌面\\sunlands3.png'
c = (
    WordCloud()
    .add(
        series_name='词云',
        data_pair=frequence,
        mask_image=bgpath
    )
    .render(htmlpath)
)

打包类的完整代码

'''
jieba分词SDK
'''

from pyecharts.charts import WordCloud
import jieba

class JiebaWordCloud:
    def __init__(self,article:str,userdict:str,stopword:list):
        '''article:整段的文章'''
        jieba.load_userdict(userdict)
        self.article = article
        self.stopword = stopword

    def wordCount(self):
        '''切词,词频统计'''
        dic = 
        cutword = list(jieba.cut(self.article))
        for w in cutword:
            if not w in self.stopword and not w.isdigit() and len(w) > 1:
                if w in dic:
                    dic[w] += 1 
                else:
                    dic[w] = 1
        wordfreq = [(k,v) for k,v in dic.items()]
        wordfreSort = sorted(wordfreq,key=lambda x:x[1],reverse=True)
        self.finalWordList = wordfreSort
    
    def userWordCloud(self,bgpath,htmlpath):
        '''渲染词云'''
        c = (
            WordCloud()
            .add(
                series_name = '词云',
                data_pair   = self.finalWordList,
                mask_image  = bgpath
            )
            .render(htmlpath)
        )
        return htmlpath

    @property 
    def returnWordList(self):
        return self.finalWordList

if __name__ == '__main__':
    from Udfs.cls_sql_new import mysql
    htmlpath = r'D:\\OneDrive\\桌面\\test.html'
    bgpath   = r'D:\\OneDrive\\桌面\\sunlands3.png'
    stopword = ['和','哈','哦',',','[',']','?','也','的','么','.','了',' ','啊','呀','呢',':',':','…','~','噢','。','+','!','额','@','-','?','*','丿','—','\\n','“','吗','是','有','我','小']    
    userdict = r'D:\\OneDrive\\doc\\for_share\\Python_eco\\python\\jieba.txt'
    sql = "select discuss from sd_business.tmp_discuss_no_reserve_kaoyan"
    res = ''.join([x[0] for x in mysql().exec(sql)])
    jb  = JiebaWordCloud(res,userdict,stopword)
    jb.wordCount()
    htm = jb.userWordCloud(bgpath,htmlpath)

案例

以微信热榜第一篇为案例吧
一直都很喜欢人物,也很喜欢这篇,也为这次事件发过公众号文章祈祷奇迹😔防蹭热点嫌疑,还是不涉及相关关键词吧

文章地址

  1. 取文章正文
    这一步需要逐行代码观察的,除了需要用正则取数以外,还需要每次观察返回结果,形成下面最后的代码
import requests
import re
def fetchBody():
	url = 'https://mp.weixin.qq.com/s?__biz=MjEwMzA5NTcyMQ==&mid=2653153430&idx=1&sn=820509fd028bc41c3334bee28e23594f&scene=0'
	res = requests.get(url)
	pat = '<span.*>.*?</span>'
	fnd = re.findall(pat,res.text)
	pat = '<.*?>'
	x   = [re.sub(pat,'',f) for f in fnd]
	body= x[2].replace('&nbsp;','')
	return body
  1. 生成词云
from Udfs.Jieba import JiebaWordCloud
htmlpath = r'D:\\OneDrive\\桌面\\test.html'
bgpath   = r'D:\\OneDrive\\桌面\\plane.png'
stopword = ['和','哈','哦',',','[',']','?','也','的','么','.','了',' ','啊','呀','呢',':','一个',':','…','~','噢','。','+','!','额','@','-','?','*','丿','—','\\n','“','吗','是','有','我','小']    
userdict = r'D:\\OneDrive\\doc\\for_share\\Python_eco\\python\\jieba.txt'
article  = fetchBody()
jb  = JiebaWordCloud(article,userdict,stopword)
jb.wordCount()
htm = jb.userWordCloud(bgpath,htmlpath)

渲染出来后一般都有bug,打开html文件是空白,刷新一下网页就可以看到结果了

  1. 效果
    效果如下,看起来是比较炫酷,但关键词实际上不是特别明显,可能干扰词太多了.要看具体词频,还是得看代码

- 期待奇迹 -

以上是关于不需要安装C++也可以用Python制作漂亮的词云的主要内容,如果未能解决你的问题,请参考以下文章

为啥用python画的词云很模糊

python为啥wordcloud 生成的词云,单词排列不够紧凑?

专属定制:用Python简洁的二十行代码做一个专属你的动漫词云图

如何用Python做词云

python学习之10行代码制作炫酷的词云图(匹配指定图形形状)

Python stylecloud制作酷炫的词云图