8. 爬虫训练场，第一个爬虫目标页设计，单页爬虫案例

Posted 2022-12-26 梦想橡皮擦

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了8. 爬虫训练场，第一个爬虫目标页设计，单页爬虫案例相关的知识，希望对你有一定的参考价值。

单页爬虫

在初学爬虫采集时，很多人都是从一个单页采集需求开始的，单页案例也分为三种，分别如下：

单篇新闻
一些图片合集
单页表格

本篇博客就在爬虫训练场中依次实现上述三个案例。

首先修改一下上一篇博客设计的卡片。

<div class="card-footer text-center ">
  <a href="#" class="card-link text-decoration-none ">学习博客</a>
  <a href="/general/news" class="card-link text-decoration-none text-success">新闻页</a>
  <a href="/general/imgs" class="card-link text-decoration-none text-success">图片清单</a>
  <a href="/general/table" class="card-link text-decoration-none text-success">表格</a>
</div>

新增加三个案例入口。

配置案例相关文件

由于爬虫训练场涉及的目标案例非常多，所以需要进行统筹管理，接下来先在 app 目录建立 general.py 文件，然后在 templates 目录中建立 general 文件夹，在依次新增 news.html，imgs.html，table.html 三个文件。此时得到的项目结构如下所示。

接下来在 general.py 文件中建立三个路由函数，代码如下。

from flask import render_template
from app import app

"""
普通爬虫，控制器相关配置
"""


@app.route('/general/news')
def news():
    return render_template('general/news.html')


@app.route('/general/imgs')
def news():
    return render_template('general/imgs.html')


@app.route('/general/table')
def news():
    return render_template('general/table.html')

为了让该控制器文件生效，还需要再 app/__init__.py 文件中，导入 general.py 模块。

from flask import Flask

app = Flask(__name__)

from app import routes
from app import general

重启项目，再浏览器访问 http://192.168.0.11:8888/general/news （具体看你的项目地址），得到下图内容，上述配置完毕。

在 news.html 文件中导入 Bootstrap 包所以文件，然后编写一篇新闻内容，具体代码请去 gitcode 或者 pachong.vip 查阅，这里仅展示最终效果。

第二个图片案例

第二个单页爬虫案例为图片列表，首先整理一些免费图片素材，将其转存到 app/static/images/faces 目录中。

然后在 imgs.html 页面依次调用。示例代码如下所示，最终得到的效果在代码后。

<div class="row">
  <div class="col-sm-6 col-md-3 col-lg-2 p-2">
    <img src="url_for('static',filename='images/faces/3-220P21A447-lp.jpg')" alt=""/>
  </div>
  <div class="col-sm-6 col-md-3 col-lg-2 p-2">
    <img src="url_for('static',filename='images/faces/3-220P41F009-lp.jpg')" alt="" />
  </div>
  <div class="col-sm-6 col-md-3 col-lg-2 p-2">
    <img src="url_for('static',filename='images/faces/3-220P2155220-50-lp.jpg')" alt="" />
  </div>
  <div class="col-sm-6 col-md-3 col-lg-2 p-2">
    <img src="url_for('static',filename='images/faces/3-220P2160142-lp (1).jpg')" alt="" />
  </div>
  <div class="col-sm-6 col-md-3 col-lg-2 p-2">
    <img src="url_for('static',filename='images/faces/3-220P5151547-lp.jpg')" alt="" />
  </div>
  <div class="col-sm-6 col-md-3 col-lg-2 p-2">
    <img src="url_for('static',filename='images/faces/7-220I1094505-lp.jpg')" alt="" />
  </div>
</div>

单页表格

网页中仅包含一个表格实现起来比较容易，效果如下。

<table class="table table-striped table-hover" cellpadding="0" cellspacing="0">
  <tbody>
    <tr>
      <td style="width:10%;">序号</td>
      <td style="width:35%;">学校</td>
      <td style="width:15%;">本/专</td>
      <td style="width:15%;">公/民</td>
      <td style="width:25%;">查看详情</td>
    </tr>
    <tr>
      <td>1</td>
      <td>北京大学</td>
      <td>本科</td>
      <td>公办</td>
      <td><a href="/school/北京大学.html" target="_blank">进入主页</a></td>
    </tr>
  </tbody>
</table>

三个最简单的案例都已经完成，下面针对实践过程中，首页样式 BUG，进行完善。

首页完善

在本篇博客案例编写的过程中，发现首页切换到小屏幕时，会出现下图所示样式 BUG。

该问题出现的原因是由于固定了卡片高度，所以接下来针对该问题进行修复。

去掉页面固定高度

删除卡片标签的 style="height:268px;" 这一配置即可，但是又出现了卡片高度不一致问题。

增加最小高度，最小宽度样式，并且页面栅格布局修改为自动适配，代码如下。

<div class="col mt-2">
  <div class="card  border-secondary rounded-5 shadow-sm" style="min-height:268px;min-width:300px;" >
    <div class="card-header text-center">
      <h4 class="card-title">单页爬虫</h4>
    </div>
    <div class="card-body">
      <p class="card-text">
        目标数据呈现在单一页面中，使用最简单的爬虫库可以直接采集，一般用正则表达式即可完成数据提取。
      </p>
      <p class="card-text text-left">难度：⭐</p>
      <p class="card-text">
        案例：
        <a href="/general/news" class="card-link text-success">新闻页</a>
        <a href="/general/imgs" class="card-link text-success">图片清单</a>
        <a href="/general/table" class="card-link text-success">表格</a>
      </p>
    </div>
    <div class="card-footer text-end">
      <a href="#" class="btn btn-primary card-link ">学习博客</a>
    </div>
  </div>
</div>

此时实现的效果如下所示。

📢📢📢📢📢📢
💗 你正在阅读 【梦想橡皮擦】 的博客
👍 阅读完毕，可以点点小手赞一下
🌻 发现错误，直接评论区中指正吧
📆 橡皮擦的第 802 篇原创博客

从订购之日起，案例 5 年内保证更新

以上是关于8. 爬虫训练场，第一个爬虫目标页设计，单页爬虫案例的主要内容，如果未能解决你的问题，请参考以下文章