一个完整的大作业
Posted zhoujinpeng
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了一个完整的大作业相关的知识,希望对你有一定的参考价值。
选一个自己感兴趣的主题
首先选取一个网站,我选取手游网站进行爬虫操作,网站网址为http://xin.ptbus.com/indiegame/news/
网络上爬取相关的数据
import requests from bs4 import BeautifulSoup url = \'http://xin.ptbus.com/indiegame/news/\' res = requests.get(url) res.encoding=\'utf-8\' soup=BeautifulSoup(res.text,\'html.parser\') for news in soup.select(\'li\'): if len(news.select(\'.ecst\'))>0: title=news.select(\'.ecst\')[0].text url=news.select(\'a\')[0][\'href\'] source=soup.select(\'span\')[0].text resd=requests.get(url) resd.encoding=\'utf-8\' soupd=BeautifulSoup(resd.text,\'html.parser\') pa=soupd.select(\'.gmIntro\')[0].text print(title,url,source,pa)
爬取网站的数据如下图。
进行文本分析,生成词云
将爬取到的数据直接制作成词云。
import requests from bs4 import BeautifulSoup import jieba url = \'http://xin.ptbus.com/indiegame/news/\' res = requests.get(url) res.encoding=\'utf-8\' soup=BeautifulSoup(res.text,\'html.parser\') for news in soup.select(\'li\'): if len(news.select(\'.ecst\'))>0: title=news.select(\'.ecst\')[0].text url=news.select(\'a\')[0][\'href\'] source=soup.select(\'span\')[0].text resd=requests.get(url) resd.encoding=\'utf-8\' soupd=BeautifulSoup(resd.text,\'html.parser\') pa=soupd.select(\'.gmIntro\')[0].text print(title,url,source,pa) words = jieba.lcut(pa) ls = [] counts = {} for word in words: ls.append(word) if len(word) == 1: continue else: counts[word] = counts.get(word,0)+1 items = list(counts.items()) items.sort(key = lambda x:x[1], reverse = True) for i in range(10): word , count = items[i] print ("{:<5}{:>2}".format(word,count)) from wordcloud import WordCloud import matplotlib.pyplot as plt cy = WordCloud(font_path=\'msyh.ttc\').generate(pa)#wordcloud默认不支持中文,这里的font_path需要指向中文字体 plt.imshow(cy, interpolation=\'bilinear\') plt.axis("off") plt.show()
效果图如下,毕竟是一个手游资讯网站,游戏的字眼出现很频繁,而黎明危机则是一款即将上市的游戏,因此关注度比较高。
import requests from bs4 import BeautifulSoup import jieba import pandas import sqlite3 def onepage(pageurl): res = requests.get(pageurl) res.encoding=\'utf-8\' soup=BeautifulSoup(res.text,\'html.parser\') newsls = [] for news in soup.select(\'li\'): if len(news.select(\'.ecst\'))>0: newsls.append(news.select(\'a\')[0][\'href\']) newsls.append(news.select(\'.ecst\')[0].text) return(newsls) newstotal = [] dmurl=\'http://xin.ptbus.com/indiegame/news/\' newstotal.extend(onepage(dmurl)) for i in range(2,3): listurl=\'http://xin.ptbus.com/indiegame/news/{}.html\'.format(i) newstotal.extend(onepage(listurl)) df = pandas.DataFrame(newstotal) df.to_excel(\'news.xlsx\') with sqlite3.connect(\'dmnewsdb.sqlite\') as db: df.to_sql(\'dmnewsdb8\',con = db)
以上是关于一个完整的大作业的主要内容,如果未能解决你的问题,请参考以下文章