爬虫大作业

Posted 宁缺-勿滥

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬虫大作业相关的知识,希望对你有一定的参考价值。

对豆瓣读书网进行书评书单推荐简介和推荐链接数据爬取:

  

from bs4 import BeautifulSoup
import requests
import jieba
import time
import datetime

r = requests.get(\'https://book.douban.com\')
lyrics = \'\'
html=r.text

soup = BeautifulSoup(html, \'html.parser\')

items = []
global_nav_items = soup.find(\'div\', class_=\'global-nav-items\')

for tag in global_nav_items.find_all(\'a\'):
    items.append(tag.string)

print(items)
# /定义一个数据存储类
class Info(object):
    def __init__(self, title, img, link, author, year, pulisher, abstract):
        self.title = title
        self.img = img
        self.link = link
        self.author = author
        self.year = year
        self.publisher = publisher
        self.abstract = abstract

new_book_html = soup.find(\'ul\', class_=\'list-col list-col5 list-express slide-item\')

book_info_list = []

for tag in new_book_html.find_all(\'li\'):
    info_html = tag.find(\'div\', class_=\'info\')
    info_title = info_html.find(\'a\')
    title = info_title.string.strip()
    cover = tag.find(\'div\', class_=\'cover\')
    img = cover.find(\'img\')[\'src\'].strip()
    href = info_title[\'href\'].strip()
    author = info_html.find(class_=\'author\').string.strip()
    year = info_html.find(class_=\'year\').string.strip()
    publisher = info_html.find(class_=\'publisher\').string.strip()
    abstract = info_html.find(class_=\'abstract\').string.strip()
    book = Info(title, img, href, author, year, publisher, abstract)
    book_info_list.append(book)

print(\'推荐%s本新书\' %  len(book_info_list))
for book in book_info_list:
    print(\'*\'*100)
    print(book.title)
    print(book.img)
    print(book.link)
    print(book.author)
    print(book.year)
    print(book.publisher)
    print(book.abstract)

  

将所爬的数据存储在mark down文件中:

def save():
    today = datetime.datetime.fromtimestamp(time.time()).strftime(\'%Y-%m-%d\')
    file_name = \'豆瓣\'+today+\'推荐书单\'
    with open(file_name+\'.md\', \'w\') as file:
        file.write(\'#\'+file_name)
        file.write(\'\\\\n---\')
    with open(file_name+\'.md\', \'a\') as file:
        num = 1
        for book in book_info_list:
            file.write(\'\\\\n\\\\n\')
            file.write(\'## \' + str(num) +\'. \' + book.title)
            file.write(\'\\\\n\')
            file.write(\'![\'+book.title+\' cover img](\'+book.img+\')\')
            file.write(\'\\\\n\\\\n\')
            file.write(\'简介\\\\n\')
            file.write(\'---\\\\n\')
            file.write(book.abstract)
            file.write(\'\\\\n\\\\n\')
            file.write(\'作者:     \'+book.author+\'\\\\n\\\\n\')
            file.write(\'出版时间: \'+book.year+\'\\\\n\\\\n\')
            file.write(\'出版社:   \'+book.publisher+\'\\\\n\\\\n\')
            file.write(\'[更多...](\'+book.link+\')\')
            num = num + 1

if __name__ == \'__main__\':
    save()

  

截图:

词云生成截图:

 

 

相关问题:

1.在电脑无法安装词云wordcloud,将代码复制在在线词云生成器进行词云生成;

2.数据无法直接保存至文本文件,改用mark down 文件存储;

3.运行有错误提示:

Traceback (most recent call last):
File "C:/Users/Administrator/PycharmProjects/whr/suy.py", line 116, in <module>
save()
File "C:/Users/Administrator/PycharmProjects/whr/suy.py", line 109, in save
file.write(\'作者: \'+book.author+\'\\\\n\\\\n\')
UnicodeEncodeError: \'gbk\' codec can\'t encode character \'\\xa0\' in position 19: illegal multibyte sequence

暂时还没有找到什么解决办法。

 

以上是关于爬虫大作业的主要内容,如果未能解决你的问题,请参考以下文章

爬虫大作业

爬虫大作业

HTML5期末大作业:餐饮美食网站设计——咖啡(10页) HTML+CSS+JavaScript 学生DW网页设计作业成品 web课程设计网页规划与设计 咖啡网页设计 美食餐饮网页设计...(代码片段

Python大作业——爬虫+可视化+数据分析+数据库(可视化篇)

Python大作业——爬虫+可视化+数据分析+数据库(数据分析篇)

Python课程设计大作业:利用爬虫获取NBA比赛数据并进行机器学习预测NBA比赛结果