爬虫：BeautifulSoup--select

Posted 2022-11-28

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了爬虫：BeautifulSoup--select相关的知识，希望对你有一定的参考价值。

Beautiful Soup中的select

Beautiful Soup中的select也是过滤器的一种，个人认为要比find_all()好用一点

find_all()的返回方式是列表，以主页为例，探究一下select

# coding=utf-8
from bs4 import  BeautifulSoup
import requests

url = https://www.cs.net/
headers = 
    User-Agent:Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/61.0,
    Referer:https://www.cs.net/

html = requests.get(url, headers)
soup = BeautifulSoup(html.text, features=html.parser)

1.按标签查询

tag = soup.select(title);
print(tag)

#输出
#[<title>专业IT技术社区</title>]

2.按类名查询 – 类名前加点

class_ = soup.select(.carousel-caption)
print(class_)

#输出
#class_ = soup.select(.carousel-caption)
# [<div class="carousel-caption">前端工程师凭什么这么值钱？</div>, 
# <div class="carousel-caption">让面试官颤抖的Tomcat系统架构！</div>, 
# <div class="carousel-caption">上班时间“划水”、下班时间“加班”。钱和命，孰轻孰重？</div>, 
# <div class="carousel-caption"> 面试定心丸：AI知识点备忘录(包括ML、DL、Python、Pandas等）</div>, 
# <div class="carousel-caption">Google发布“多巴胺”开源强化学习框架，三大特性全满足</div>]

3.按id查询 – id前加

html2 = <body>
    <p class=""><b>The Dormouses story</b></p>
    <p class="story">
        <a href="" id="link1">link1</a>
        <a href="" id="link2">link2</a>
        <a href="" id="link3">link3</a>
    </p>
 </body>
soup = BeautifulSoup(html2, features=html.parser)
id = soup.select(#link1)
print(id)

#输出
#[<a href="" id="link1">link1</a>]

4.组合查询 – 父子标签间空格

rep = soup.select(".clearfix .list_con .title h2 a")
for url in rep:
    print(url.text, url.get(href))

#输出

以上是关于爬虫：BeautifulSoup--select的主要内容，如果未能解决你的问题，请参考以下文章

python爬虫时，bs4无法读取网页标签中的文本？

从 BeautifulSoup.select 检索整个列表作为文本

BeautifulSoup select方法

C#爬虫爬虫的多线程如何实现

爬虫.多线程爬虫与多进程爬虫

Python多线程和多进程爬虫