40行代码爬取金庸所有武侠小说

Posted 2020-10-18 jclian91

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了40行代码爬取金庸所有武侠小说相关的知识，希望对你有一定的参考价值。

　　我们今天来用Python爬虫爬取金庸所有的武侠小说，网址为：http://jinyong.zuopinj.com/，网页如下：

　　Python代码如下：

 1 # -*- coding: utf-8 -*-
 2 import urllib.request  
 3 from bs4 import BeautifulSoup
 4 
 5 #get book\'s chapter
 6 def get_chapter(url):
 7     html = urllib.request.urlopen(url)  
 8     content = html.read().decode(\'utf8\')
 9     html.close()
10     soup = BeautifulSoup(content, "lxml")
11     title = soup.find(\'h1\').text    #chapter title
12     text = soup.find(\'div\', id=\'htmlContent\')    #chapter content
13     #processing the content to get nice style
14     content = text.get_text(\'\\n\',\'br/\').replace(\'\\n\', \'\\n    \')
15     content = content.replace(\'　　\', \'\\n　　\')
16     return title, \'    \'+content
17         
18 def main():
19     books = [\'射雕英雄传\',\'天龙八部\',\'鹿鼎记\',\'神雕侠侣\',\'笑傲江湖\',\'碧血剑\',\'倚天屠龙记\',\\
20              \'飞狐外传\',\'书剑恩仇录\',\'连城诀\',\'侠客行\',\'越女剑\',\'鸳鸯刀\',\'白马啸西风\',\\
21              \'雪山飞狐\']
22     order = [1,2,3,4,5,6,7,8,10,11,12,14,15,13,9]  #order of books to scrapy
23     #list to store each book\'s scrapying range
24     page_range = [1,43,94,145,185,225,248,289,309,329,341,362,363,364,375,385]
25 
26     for i,book in enumerate(books):
27         for num in range(page_range[i],page_range[i+1]):
28             url = "http://jinyong.zuopinj.com/%s/%s.html"%(order[i],num)
29             try:
30                 title, chapter = get_chapter(url)
31                 with open(\'E://%s.txt\'%book, \'a\', encoding=\'gb18030\') as f:
32                     print(book+\':\'+title+\'-->写入成功！\')
33                     f.write(title+\'\\n\\n\\n\')
34                     f.write(chapter+\'\\n\\n\\n\')
35             except Exception as e:
36                 print(e) 
37     print(\'全部写入完毕!\')
38 
39 main()

　　运行结果如下：

上面的运行结果“HTTP Error 404: Not Found”是因为这个网页不存在，并不影响书本内容的完整性。我们可以去E盘查看文件是否下载成功：

·　　15本书都下载完毕了！整个过程才用了不到10分钟！爬虫的力量真是伟大啊~~

以上是关于40行代码爬取金庸所有武侠小说的主要内容，如果未能解决你的问题，请参考以下文章