python简单实现爬取小说《天龙八部》，并在页面本地访问

Posted 2020-10-08

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python简单实现爬取小说《天龙八部》，并在页面本地访问相关的知识，希望对你有一定的参考价值。

写在前面：第一次使用爬虫，甚至都算不上爬虫，水平有限，主要作为学习记录。

主要业务流程如下：

使用python的requests模块获取页面信息

通过re模块（正则表达式）取出需要的内容（小说标题，正文）

通过MysqlDB模块入库

使用webpy模块进行访问

下面是效果图，简单实现了点击上一页、下一页翻页的功能：

技术分享

目录结构如下：

D:\\PROJECT\\SPIDER
│ fiction_spider.py
│ webapp.py
│
└─template
index.html

爬取信息及入库代码如下：

#coding:utf-8
#fiction_spider.py
import requests
import re
import MySQLdb

def get_title():
    html = requests.get(‘http://www.jinyongwang.com/tian/‘).content
    rem = r‘<li><a href="(.*?)">(.*?)</a>‘
    return  re.findall(rem,html)

def get_content(url):
    html = requests.get(‘http://www.jinyongwang.com/‘+url).content
    #print html
    matchs_p = r‘<p>(.*?)</p><script.*?‘
    data = re.findall(matchs_p, html)
    return data[0]

if __name__ == ‘__main__‘:
    a = MySQLdb.connect(host=‘10.1.*.*‘, port=3306, user=‘user‘, passwd=‘passwd‘, db=‘testdb‘, charset=‘utf8‘)
    for i in get_title():
        cur = a.cursor()
        print i[1]
        print i[0]
        sqli = ‘INSERT INTO `fiction` (`title`, `content`) VALUES ("%s","%s" )‘%(i[1],get_content(i[0]))
        cur.execute(sqli)
        cur.close()
        a.commit()
    a.close()

页面代码如下：

#coding:utf-8
#webapp.py
import web
import re

urls = (‘/(.*)‘,‘Index‘)

db = web.database(dbn = ‘mysql‘,host=‘10.1.*.*‘, port=3306, user=‘user‘, passwd=‘passwd‘, db=‘testdb‘, charset=‘utf8‘)

render = web.template.render(‘template‘)

class Index:
    def GET(self,html):
        id = re.findall(‘(.*?).html‘,html)[0]
        print id
        data = db.query("select * from fiction where id=%s"%id)
        return render.index(data[0],id)

if __name__ == ‘__main__‘:

    web.application(urls,globals()).run()

页面访问的index.html内容如下：

$def with(data,s)
<meta charset="utf-8"/>
<title>$:data.title</title>
<h1>$:data.title</h1>
<div style="margin:0px auto;text-align:center;">
<a href="$:(int(s)-1).html">上一页</a>
<a href="$:(int(s)+1).html">下一页</a>
</div>
$:data.content
<br>
<div style="margin:0px auto;text-align:center;">
<a href="$:(int(s)-1).html">上一页</a>
<a href="$:(int(s)+1).html">下一页</a>
</div>

以上是关于python简单实现爬取小说《天龙八部》，并在页面本地访问的主要内容，如果未能解决你的问题，请参考以下文章

用python爬虫简单爬取笔趣网：类“起点网”的小说

爬虫简单之二---使用进程爬取起点中文网的六万多也页小说的名字，作者，等一些基本信息，并存入csv中

Golang 简单爬虫实现，爬取小说

python 爬取qidian某一页全部小说

Python爬取起点中文网小说信息及封面图片

python爬取小说