爬虫-数据解析-bs4

Posted 2022-09-08 bigox

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了爬虫-数据解析-bs4相关的知识，希望对你有一定的参考价值。

1.数据解析

解析: 根据指定的规则对数据进行提取
作用: 实现聚焦爬虫
数据解析方式:
```
- 正则表达式
- bs4
- xpath
```
数据解析的通用原理:
- 数据解析需要作用在页面源码中(一组html标签组成的)
```
html:的核心作用是展示数据
```
- 通用原理:
  - 标签定位
  - 获取文本或者属性

正则表达式实现数据解析

# 需求:爬取糗事百科中糗图数据
import requests
headers = 
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'


#方式1:
url = 'https://pic.qiushibaike.com/system/pictures/12217/122176396/medium/OM37E794HBL3OFFF.jpg'
img_data = requests.get(url=url,headers=headers).content #content返回的是byte类型的数据
with open('./123.jpg','wb') as fp:
    fp.write(img_data)

#方式2:
from urllib import request
url = 'https://pic.qiushibaike.com/system/pictures/12217/122176396/medium/OM37E794HBL3OFFF.jpg'
request.urlretrieve(url,'./456.jpg')


- 方式2不可以使用UA伪装的机制

- urllib就是一个比较老的网络请求的模块,在requests模块没有出现之前,请求发送的操作使用的都是urllib

2.bs4 解析模块

模块安装 :
- pip install bs4
- pip install lxml
bs4 的解析原理
- 实例化一个beautifulSoup的对象,并且将即将被解析的源码数据加载到该对象中
- 调用beautifulSoup对象中的相关属性和方法继续宁标签定位和数据提取
如何实例化BeautifulSoup对象
- BeautifulSoup(fp,‘lxml‘):专门用作于解析本地存储的html文档中的数据
```
from bs4 import BeautifulSoup
fp = open('./test.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml') #将即将被解析的页面源码加载到该对象中
```
- BeautifulSoup(page_text,‘lxml‘):专门用作于将互联网上请求到的页面源码数据进行解析

标签定位:

soup = BeautifulSoup(page_text,‘lxml‘) 实例化一个对象
soup.tagName:定位到第一个TagName标签,返回的是单数
属性定位:soup.find(‘tagName‘,attrName=‘value‘),返回也是单数
- find_all:和find用法一致,但是返回值是列表

选择器定位:select(‘选择器‘),返回值为列表

标签选择器,类选择器,id选择器,层级选择器(>:一个层级,空格:多个层级)

from bs4 import BeautifulSoup
fp = open('./test.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml') #将即将被解析的页面源码加载到该对象中
soup.p
soup.find('div',class_='song')
soup.find_all('div',class_='song')
soup.select('.tang')
soup.select('#feng')
soup.select('.tang > ul > li')
soup.select('.tang li')
li_6 = soup.select('.tang > ul > li')[6]
i_tag = li_6.i
i_tag.string
soup.find('div',class_='tang').text
soup.find('a',id="feng")['href']

提取数据

取文本:
- tag.string:标签中直系的文本内容(只提取直系内的文本)
- tag.text:标签中所有的文本内容
取属性:
- tag[‘attrName‘]

# 爬取三国演义整篇小说内容http://www.shicimingju.com/book/sanguoyanyi.html

url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url,headers=headers).text
soup = BeautifulSoup(page_text,'lxml')
a_list = soup.select('.book-mulu > ul > li > a')
fp = open('sanguo.txt','w',encoding='utf-8')
for a in a_list:
    detail_url = 'http://www.shicimingju.com'+a['href']
    chap_title = a.string
    #对章节详情页的url发起请求,解析详情页中的章节内容
    detail_page_text = requests.get(detail_url,headers=headers).text
    soup = BeautifulSoup(detail_page_text,'lxml')
    chap_content = soup.find('div',class_="chapter_content").text
    fp.write(chap_title+':'+chap_content+'\n')
    print(chap_title,'爬取成功!')
fp.close()

以上是关于爬虫-数据解析-bs4的主要内容，如果未能解决你的问题，请参考以下文章