python3生成标签云
Posted TTyb
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python3生成标签云相关的知识,希望对你有一定的参考价值。
标签云是现在大数据里面最喜欢使用的一种展现方式,其中在python3下也能实现标签云的效果,贴图如下:
-------------------进入正文---------------------
首先要安装以下几个库:
1 #!/usr/bin/python3.4 2 # -*- coding: utf-8 -*- 3 4 # http://www.lfd.uci.edu/~gohlke/pythonlibs/#cx_freeze 5 # 万能仓库下载pygame 6 # pip3下载simplejson
还有最重要的库:
pip3 install pytagcloud
或者去官网下载:
https://pypi.python.org/pypi/pytagcloud/
安装完毕,利用官网的例子来做:
1 from pytagcloud import create_tag_image, make_tags 2 from pytagcloud.lang.counter import get_tag_counts 3 4 YOUR_TEXT = "A tag cloud is a visual representation for text data, typically\\ 5 used to depict keyword metadata on websites, or to visualize free form text." 6 7 tags = make_tags(get_tag_counts(YOUR_TEXT), maxsize=120) 8 9 create_tag_image(tags, \'cloud_large.png\', size=(900, 600), fontname=\'Lobster\')
果断报错:
Traceback (most recent call last): File "D:/code/pythonwork/Text.py", line 96, in <module> tags = make_tags(get_tag_counts(YOUR_TEXT), maxsize=120) File "C:\\Python34\\lib\\site-packages\\pytagcloud\\lang\\counter.py", line 25, in get_tag_counts return sorted(counted.iteritems(), key=itemgetter(1), reverse=True) AttributeError: \'dict\' object has no attribute \'iteritems\'
看了发现问题出在库中的:
# counter.py return sorted(counted.iteritems(), key=itemgetter(1), reverse=True)
原来是python3.4不支持写法:
在Python2.x中,items( )用于 返回一个字典的拷贝列表【Returns a copy of the list of all items (key/value pairs) in D】,占额外的内存。
iteritems() 用于返回本身字典列表操作后的迭代【Returns an iterator on all items(key/value pairs) in D】,不占用额外的内存。
Python 3.x 里面,iteritems() 和 viewitems() 这两个方法都已经废除了,而 items() 得到的结果是和 2.x 里面 viewitems() 一致的。在3.x 里 用 items()替换iteritems() ,可以用于 for 来循环遍历。
但是当我换成:
# counter.py return sorted(counted.items(), key=itemgetter(1), reverse=True)
发现运行并没有错误,但是没有生成标签云啊,一遍一遍打印出来,终于找到问题了:
from pytagcloud import create_tag_image
这个是为了生成一个元组的东西:
# counts =[(\'cloud\', 3), # (\'words\', 2), # (\'code\', 1), # (\'word\', 1), # (\'appear\', 1)]
但是python3里面的items()是达不到这个效果的,所以我就自己写吧。
读取txt文件,将每一行都按照空格划分成一个个数组的元素:
1 arr = [] 2 file = open(\'../tagcloud/tag_file.txt\', \'r\') 3 data = file.read().split(\'\\r\\n\') 4 for content in data: 5 contents = validatecontent(content).split() 6 for word in contents: 7 arr.append(word)
[\'BAISC\', \'Python\', \'BASICA\', \'GVBASIC\', \'GWBASIC\', \'Python\', \'ETBASIC\', \'QBASIC\', \'Quick\', \'Basic\', \'Turbo\', \'Basic\', \'True\', \'Python\', \'java\', \'Basic\', \'Visual\', \'Basic\', \'Visual\', \'Basic\', \'Net\', \'Power\', \'Basic\', \'Python\', \'java\', \'SQL\', \'VB\', \'Small\', \'Basic\', \'Free\', \'Basic\', \'DarkBASIC\', \'VBScript\', \'Visual\', \'Basic\', \'For\', \'ApplicationsVBA\', \'REALbasic\', \'C\', \'C\', \'Turbo\', \'C\', \'Python\', \'java\', \'SQL\', \'VB\', \'php\', \'html\', \'Borland\', \'C\', \'C\', \'Builder\', \'CCLI\', \'Python\', \'java\', \'ObjectiveC\', \'C#\', \'Microsoft\', \'Visual\', \'C\', \'Pascal\', \'Delphi\', \'Turbo\', \'Python\', \'java\', \'SQL\', \'VB\', \'PHP\', \'HTML\', \'Pascal\', \'Object\', \'Pascal\', \'Free\', \'Pascal\', \'Lazarus\', \'FORTRAN\', \'MATLAB\', \'Scilab\', \'GNU\', \'Octave\', \'R\', \'SPlus\', \'Mathematica\', \'Maple\', \'Python\', \'java\', \'SQL\', \'VB\', \'PHP\', \'HTML\', \'Julia\', \'xBaseClipper\', \'Visual\', \'FoxPro\', \'SQLPLSQL\', \'TSQL\', \'SQLPSM\', \'LINQ\', \'Xquer\', \'Lua\', \'Python\', \'java\', \'SQL\', \'VB\', \'Perl\', \'PHP\', \'Python\', \'Ruby\', \'ASP\', \'JSP\', \'TclTk\', \'VBScript\', \'AppleScript\', \'AAuto\', \'ActionScript\', \'DMDScript\', \'ECMAScript\', \'javascript\', \'JScript\', \'TypeScript\', \'sh\', \'bash\', \'Python\', \'java\', \'SQL\', \'VB\', \'PHP\', \'HTML\', \'sed\', \'awk\', \'PowerShell\', \'csh\', \'tcsh\', \'ksh\', \'zsh\', \'XMLSVG\', \'XML\', \'Schema\', \'Python\', \'java\', \'XSLT\', \'XHTML\', \'MathML\', \'XAML\', \'SSML\', \'SGML\', \'HTML\', \'Python\', \'java\', \'SQL\', \'VB\', \'Curl\', \'SVG\', \'XML\', \'Schema\', \'XSLT\', \'XHTML\', \'MathML\', \'XAML\', \'SSML\', \'Java\', \'Jython\', \'JRuby\', \'JScheme\', \'Groovy\', \'Kawa\', \'Scala\', \'Clojure\', \'ALGOL\', \'APLJ\', \'Ada\', \'Falcon\', \'Forth\', \'Io\', \'MUMPS\', \'PLI\', \'PostScript\', \'REXX\', \'SAC\', \'Self\', \'Simula\', \'Swift\', \'IronPython\', \'IronRuby\', \'COBOL\', \'Python\', \'java\', \'SQL\', \'VB\', \'PHP\', \'HTML\']
其中validatecontent是起初非法字符的函数:
1 # 去除内容中的非法字符 (Windows) 2 def validatecontent(content): 3 # \'/\\:*?"<>|\' 4 rstr = r"[\\/\\\\\\:\\*\\?\\"\\<\\>\\|\\.\\*\\+\\-\\(\\)\\"\\\'\\(\\)\\!\\?\\“\\”\\,\\。\\;\\:\\{\\}\\{\\}\\=\\%\\*\\~\\·]" 5 new_content = re.sub(rstr, "", content) 6 return new_content
对每一个元素都来个计数:
from collections import Counter counts = Counter(arr).items() print(counts)
效果出来了:
dict_items([(\'For\', 1), (\'SQL\', 8), (\'JRuby\', 1), (\'Builder\', 1), (\'HTML\', 6), (\'LINQ\', 1), (\'BAISC\', 1), (\'BASICA\', 1), (\'PHP\', 6), (\'Octave\', 1), (\'csh\', 1), (\'PostScript\', 1), (\'awk\', 1), (\'Ruby\', 1), (\'AppleScript\', 1), (\'Object\', 1), (\'java\', 11), (\'TclTk\', 1), (\'Xquer\', 1), (\'ksh\', 1), (\'zsh\', 1), (\'ETBASIC\', 1), (\'AAuto\', 1), (\'Borland\', 1), (\'SVG\', 1), (\'Jython\', 1), (\'Simula\', 1), (\'IronPython\', 1), (\'Python\', 14), (\'Microsoft\', 1), (\'ActionScript\', 1), (\'XHTML\', 2), (\'REXX\', 1), (\'COBOL\', 1), (\'Scilab\', 1), (\'Ada\', 1), (\'Basic\', 9), (\'GVBASIC\', 1), (\'ECMAScript\', 1), (\'TypeScript\', 1), (\'Falcon\', 1), (\'Clojure\', 1), (\'ASP\', 1), (\'ALGOL\', 1), (\'XMLSVG\', 1), (\'GWBASIC\', 1), (\'VBScript\', 2), (\'CCLI\', 1), (\'Lazarus\', 1), (\'Julia\', 1), (\'JSP\', 1), (\'PowerShell\', 1), (\'IronRuby\', 1), (\'Power\', 1), (\'FORTRAN\', 1), (\'Self\', 1), (\'Perl\', 1), (\'Small\', 1), (\'FoxPro\', 1), (\'REALbasic\', 1), (\'GNU\', 1), (\'Mathematica\', 1), (\'True\', 1), (\'Visual\', 5), (\'JScheme\', 1), (\'Maple\', 1), (\'Quick\', 1), (\'Turbo\', 3), (\'SAC\', 1), (\'JScript\', 1), (\'APLJ\', 1), (\'sh\', 1), (\'Kawa\', 1), (\'Pascal\', 4), (\'TSQL\', 1), (\'SPlus\', 1), (\'C\', 6), (\'xBaseClipper\', 1), (\'tcsh\', 1), (\'SQLPSM\', 1), (\'ApplicationsVBA\', 1), (\'SSML\', 2), (\'R\', 1), (\'Groovy\', 1), (\'XSLT\', 2), (\'MUMPS\', 1), (\'bash\', 1), (\'DarkBASIC\', 1), (\'SGML\', 1), (\'XAML\', 2), (\'VB\', 8), (\'Curl\', 1), (\'Schema\', 2), (\'MATLAB\', 1), (\'MathML\', 2), (\'Lua\', 1), (\'Net\', 1), (\'ObjectiveC\', 1), (\'JavaScript\', 1), (\'Java\', 1), (\'Io\', 1), (\'Free\', 2), (\'Delphi\', 1), (\'sed\', 1), (\'XML\', 2), (\'Forth\', 1), (\'C#\', 1), (\'SQLPLSQL\', 1), (\'QBASIC\', 1), (\'DMDScript\', 1), (\'Swift\', 1), (\'Scala\', 1), (\'PLI\', 1)])
最后直接代入进去就行了:
1 tags = make_tags(counts, maxsize=120) 2 3 create_tag_image(tags, \'cloud_large.png\', size=(900, 600), fontname=\'Lobster\')
具体的修正需要自己慢慢去琢磨了,比如文字大小、图片大小、背景颜色等等。
到这里标签云是算完成了的,但是却是不支持中文,原因是没有合适的ttf字体文件,准备一个 ttf 中文字体,如MicrosoftYaHei.ttf ,将其移动到
# C:\\Python34\\Lib\\site-packages\\pytagcloud\\fonts
接着就是更改fonts.json文件,按照样式添加类似于css的东西:
{ "name": "MicrosoftYaHei", "ttf": "MicrosoftYaHei.ttf", "web": "none" }
注意前后的逗号就行。最后将这里的代码改一下:
create_tag_image(tags, \'cloud_large.png\', size=(900, 600), fontname=\'MicrosoftYaHei\')
运行,搞定!中文效果图:
我的在github里面,可以去下载看看。
以上是关于python3生成标签云的主要内容,如果未能解决你的问题,请参考以下文章