从 html / json 页面中提取特定部分的最佳方法?
Posted
技术标签:
【中文标题】从 html / json 页面中提取特定部分的最佳方法?【英文标题】:Best way to extract specific parts from html / json page? 【发布时间】:2020-09-12 00:40:57 【问题描述】:我有以下从 python 请求返回的内容:
"error":"ErrorMessage":"
<div>
<p>To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here
<a href=\\"http:\\/\\/www.southhams.gov.uk\\/wastequestion\\">www.southhams.gov.uk\\/wastequestion<\\/a><\\/p><\\/div>","CodeName":"Success","ErrorStatus":0,"calendar":"calendar":"
<div class=\\"wsResponse\\">To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here
<a href=\\"http:\\/\\/www.southhams.gov.uk\\/wastequestion\\">www.southhams.gov.uk\\/wastequestion<\\/a><\\/div>","binCollections":"tile":[["
<div class=\'collectionDiv\'>
<div class=\'fullwidth\'>
<h3>Organic Collection Service (Brown Organic Bin)<\\/h3><\\/div>
<div class=\\"collectionImg\\">
<img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/brown bin.png\\" \\/><\\/div>\\n
<div class=\'wdshDetWrap\'>Your brown organic bin collection is
<b>Fortnightly<\\/b> on a
<b>Thursday<\\/b>.
<br\\/> \\n Your next scheduled collection is
<b>Friday, 29 May 2020<\\/b>.
<br\\/>
<br\\/>
<a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3427\\">Read more about the Organic Collection Service ><\\/a><\\/div><\\/div>"],["
<div class=\'collectionDiv\'>
<div class=\'fullwidth\'>
<h3>Recycling Collection Service (Recycling Sacks)<\\/h3><\\/div>
<div class=\\"collectionImg\\">
<img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/SH_two_rec_sacks.png\\" \\/><\\/div>\\n
<div class=\'wdshDetWrap\'>Your recycling sacks collection is
<b>Fortnightly<\\/b> on a
<b>Thursday<\\/b>.
<br\\/> \\n Your next scheduled collection is
<b>Friday, 29 May 2020<\\/b>.
<br\\/>
<br\\/>
<a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3383\\">Read more about the Recycling Collection Service ><\\/a><\\/div><\\/div>"],["
<div class=\'collectionDiv\'>
<div class=\'fullwidth\'>
<h3>Refuse Collection Service (Grey Refuse Bin)<\\/h3><\\/div>
<div class=\\"collectionImg\\">
<img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/grey bin.png\\" \\/><\\/div>\\n
<div class=\'wdshDetWrap\'>Your grey refuse bin collection is
<b>Fortnightly<\\/b> on a
<b>Thursday<\\/b>.
<br\\/> \\n Your next scheduled collection is
<b>Thursday, 04 June 2020<\\/b>.
<br\\/>
<br\\/>
<a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3384\\">Read more about the Refuse Collection Service ><\\/a><\\/div><\\/div>"]]
我想为每个collectiondiv(3)提取以下内容
有机收集服务(棕色有机垃圾箱) 2020 年 5 月 29 日星期五
回收收集服务(回收袋) 2020 年 5 月 29 日星期五
垃圾收集服务(灰色垃圾箱) 2020 年 6 月 4 日星期四
目前我已经尝试将 response.content 加载到 python json 处理程序中,但仍然无法提取数据,所以我尝试了 BeautifulSoup 和 soup.find_all("div", class_="wdshDetWrap") 但仍然无法拉出准确的数据输出 lxml 或类似的方法会更简单吗?
感谢观看
请求代码:
url = "https://southhams.fccenvironment.co.uk/mycollections"
response = requests.request("GET", url)
cookiejar = response.cookies
for cookie in cookiejar:
print(cookie.name,cookie.value)
url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails"
payload = 'fcc_session_token=&uprn=100040282539'.format(cookie.value)
headers =
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/x-www-form-urlencoded',
'Cookie': 'fcc_session_cookie='.format(cookie.value)
response = requests.request("POST", url, headers=headers, data = payload)
print(response.status_code)
【问题讨论】:
你能提供返回该代码的部分吗? (实际的requests
部分)?
@chitown88 添加到问题底部,谢谢。
【参考方案1】:
您可以直接获取 json,然后可以调用该 html 值。完成此操作后,使用 beautifulsoup 解析 html 并在找到它的标签中打印出上下文/文本:
import requests
from bs4 import BeautifulSoup
url = "https://southhams.fccenvironment.co.uk/mycollections"
response = requests.get(url)
cookiejar = response.cookies
for cookie in cookiejar:
print(cookie.name,cookie.value)
url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails"
payload = 'fcc_session_token=&uprn=100040282539'.format(cookie.value)
headers =
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/x-www-form-urlencoded',
'Cookie': 'fcc_session_cookie='.format(cookie.value)
jsonData = requests.post(url, headers=headers, data = payload).json()
data = jsonData['binCollections']['tile']
for each in data:
soup = BeautifulSoup(each[0], 'html.parser')
collection = soup.find('div', 'class':'collectionDiv').find('h3').text.strip()
date = soup.find_all('b')[-1].text.strip()
print (collection, date)
输出:
Organic Collection Service (Brown Organic Bin) Friday, 29 May 2020
Recycling Collection Service (Recycling Sacks) Friday, 29 May 2020
Refuse Collection Service (Grey Refuse Bin) Thursday, 04 June 2020
【讨论】:
【参考方案2】:来自特定站点的 HTML 文档格式不正确。我仍然设法解决(在大约 1000 个标签的规模上效率低下)。
所以可以改进。
headers = soup.find_all('h3')
names = [tag.text[:tag.text.find('<')] for tag in headers]
dates = [tag.find_all('b')[2].text[:tag.find_all('b')[2].text.find('<')] for tag in headers]
print(names)
print(dates)
#Output
['Organic Collection Service (Brown Organic Bin)', 'Recycling Collection Service (Recycling Sacks)', 'Refuse Collection Service (Grey Refuse Bin)']
['Friday, 29 May 2020', 'Friday, 29 May 2020', 'Thursday, 04 June 2020']
【讨论】:
soup
来自哪里?
@SaschaM78 它来自使用名为 BeautifulSoup 的库解析 html。不知道为什么他没有将其包含在解决方案中
OP 已经在使用它了。所以我只是修改了他的“请求代码:”sn-p。以上是关于从 html / json 页面中提取特定部分的最佳方法?的主要内容,如果未能解决你的问题,请参考以下文章
使用 Python 从 Twitter 流 API 中提取特定的 JSON 字段
使用 AJAX 解析从 ASP 页面返回的 HTML 表并提取特定单元格