不需要安装C++也可以用Python制作漂亮的词云
Posted 但老师
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了不需要安装C++也可以用Python制作漂亮的词云相关的知识,希望对你有一定的参考价值。
文章目录
背景
需要制作词云,一般来讲是用Python
里面的jieba
做分词,然后WordCloud
生成词云。为什么要有这篇,因为WordCloud
这个模块依赖于C++
,涉及到微软的环境安装,真的是一言难尽
报错一般如下
Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure
× Encountered error while trying to install package.
╰─> wordcloud
note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.
但是实际是,不需要用WordCloud
也是可以做词云的,还非常简单。
我们用pyecharts
环境
Python
3.8jieba
pyecharts
Windows 10
步骤
个人制作词云的步骤大致如下
提取数据=> 清洗=> 自定义词停用词=> 切词=> 观察=> 词云
这中间会经历从 观察回到 自定义词停用词,然后重复上述步骤好多次,因为要不断完善词语
*停用词的意思就是应该要删除的字符
代码
词云核心代码
这一步可以参照pyecharts
的官方文档
提供
- 背景图片本地路径
- 渲染后
html
文件本地存放路径
这两即可,无需按照教程里面提取base64
编码
htmlpath = r'D:\\OneDrive\\桌面\\test.html'
bgpath = r'D:\\OneDrive\\桌面\\sunlands3.png'
c = (
WordCloud()
.add(
series_name = '词云',
data_pair = frequence,
mask_image = bgpath
)
.render(htmlpath)
)
注:frequence
是我的词频变量,形如[('学校', 463), ('推荐', 459), ('可以', 429)...]
完整代码
步骤按照上面步骤来的
下面是测试代码,适合喜欢逐步写代码的同学,也适合在notebook
里面跑
from Udfs.cls_sql_new import mysql # 自己组装的SDK,能从数据库取数就行
from pyecharts.charts import WordCloud
import jieba
import re
# 数据库取数
sql = "select discuss from sd_business.tmp_discuss_no_reserve_kaoyan"
res = mysql().exec(sql)
# 用正则剔除艾特别人的字符
pat = r'@.*\\s'
wordlist = [re.sub(pat,'',x[0]) for x in res]
# 自定义词典
filepath = r'D:\\OneDrive\\doc\\for_share\\Python_eco\\python\\jieba.txt'
jieba.load_userdict(filepath)
# 切词
cutword = [list(jieba.cut(x)) for x in wordlist]
# 停用词
stopword = ['和','哈','哦',',','[',']','?','也','的','么','.','了',' ','啊','呀','呢',':',':','…','~','噢','。','+','!','额','@','-','?','*','丿','—','\\n','“','吗','是','有','我','小']
# 词频统计&再清洗
dic =
for c in cutword:
for w in c:
if not w in stopword and not w.isdigit() and len(w) > 1:
if w in dic:
dic[w] += 1
else:
dic[w] = 1
# 字典转化成list[tuple]
wordfreq = [(k,v) for k,v in dic.items()]
# 词频降序
wordfreSort = sorted(wordfreq,key=lambda x:x[1],reverse=True)
# 词太多了 取前300个词就好
frequence = wordfreSort[:300]
# 渲染词云
htmlpath = r'D:\\OneDrive\\桌面\\test.html'
bgpath = r'D:\\OneDrive\\桌面\\sunlands3.png'
c = (
WordCloud()
.add(
series_name='词云',
data_pair=frequence,
mask_image=bgpath
)
.render(htmlpath)
)
打包类的完整代码
'''
jieba分词SDK
'''
from pyecharts.charts import WordCloud
import jieba
class JiebaWordCloud:
def __init__(self,article:str,userdict:str,stopword:list):
'''article:整段的文章'''
jieba.load_userdict(userdict)
self.article = article
self.stopword = stopword
def wordCount(self):
'''切词,词频统计'''
dic =
cutword = list(jieba.cut(self.article))
for w in cutword:
if not w in self.stopword and not w.isdigit() and len(w) > 1:
if w in dic:
dic[w] += 1
else:
dic[w] = 1
wordfreq = [(k,v) for k,v in dic.items()]
wordfreSort = sorted(wordfreq,key=lambda x:x[1],reverse=True)
self.finalWordList = wordfreSort
def userWordCloud(self,bgpath,htmlpath):
'''渲染词云'''
c = (
WordCloud()
.add(
series_name = '词云',
data_pair = self.finalWordList,
mask_image = bgpath
)
.render(htmlpath)
)
return htmlpath
@property
def returnWordList(self):
return self.finalWordList
if __name__ == '__main__':
from Udfs.cls_sql_new import mysql
htmlpath = r'D:\\OneDrive\\桌面\\test.html'
bgpath = r'D:\\OneDrive\\桌面\\sunlands3.png'
stopword = ['和','哈','哦',',','[',']','?','也','的','么','.','了',' ','啊','呀','呢',':',':','…','~','噢','。','+','!','额','@','-','?','*','丿','—','\\n','“','吗','是','有','我','小']
userdict = r'D:\\OneDrive\\doc\\for_share\\Python_eco\\python\\jieba.txt'
sql = "select discuss from sd_business.tmp_discuss_no_reserve_kaoyan"
res = ''.join([x[0] for x in mysql().exec(sql)])
jb = JiebaWordCloud(res,userdict,stopword)
jb.wordCount()
htm = jb.userWordCloud(bgpath,htmlpath)
案例
以微信热榜第一篇为案例吧
一直都很喜欢人物,也很喜欢这篇,也为这次事件发过公众号文章祈祷奇迹😔防蹭热点嫌疑,还是不涉及相关关键词吧
- 取文章正文
这一步需要逐行代码观察的,除了需要用正则取数以外,还需要每次观察返回结果,形成下面最后的代码
import requests
import re
def fetchBody():
url = 'https://mp.weixin.qq.com/s?__biz=MjEwMzA5NTcyMQ==&mid=2653153430&idx=1&sn=820509fd028bc41c3334bee28e23594f&scene=0'
res = requests.get(url)
pat = '<span.*>.*?</span>'
fnd = re.findall(pat,res.text)
pat = '<.*?>'
x = [re.sub(pat,'',f) for f in fnd]
body= x[2].replace(' ','')
return body
- 生成词云
from Udfs.Jieba import JiebaWordCloud
htmlpath = r'D:\\OneDrive\\桌面\\test.html'
bgpath = r'D:\\OneDrive\\桌面\\plane.png'
stopword = ['和','哈','哦',',','[',']','?','也','的','么','.','了',' ','啊','呀','呢',':','一个',':','…','~','噢','。','+','!','额','@','-','?','*','丿','—','\\n','“','吗','是','有','我','小']
userdict = r'D:\\OneDrive\\doc\\for_share\\Python_eco\\python\\jieba.txt'
article = fetchBody()
jb = JiebaWordCloud(article,userdict,stopword)
jb.wordCount()
htm = jb.userWordCloud(bgpath,htmlpath)
渲染出来后一般都有bug,打开html文件是空白,刷新一下网页就可以看到结果了
- 效果
效果如下,看起来是比较炫酷,但关键词实际上不是特别明显,可能干扰词太多了.要看具体词频,还是得看代码
- 期待奇迹 -
以上是关于不需要安装C++也可以用Python制作漂亮的词云的主要内容,如果未能解决你的问题,请参考以下文章
python为啥wordcloud 生成的词云,单词排列不够紧凑?
专属定制:用Python简洁的二十行代码做一个专属你的动漫词云图