如何使用汤从页面中提取列中的数据
Posted
技术标签:
【中文标题】如何使用汤从页面中提取列中的数据【英文标题】:How to extract data in columns from page using soup 【发布时间】:2020-01-03 08:46:27 【问题描述】:尝试捕获项目符号中存在的数据
链接https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/
这里需要使用xpath提取数据
要提取的数据
4 Door Sedan
4 Cylinder, 1.8 Litre
Constantly Variable Transmission, Front Wheel Drive
Petrol - Unleaded ULP
6.4 L/100km
试过这个:
import requests
import lxml.html as lh
import pandas as pd
import html
from lxml import html
from bs4 import BeautifulSoup
import requests
cars = []
urls = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/']
for url in urls:
car_data=
headers = 'User-Agent':'Mozilla/5.0'
page = (requests.get(url, headers=headers))
tree = html.fromstring(page.content)
if tree.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[4]/div/div'):
car_data["namings"] = tree.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[4]/div/div')[0]
【问题讨论】:
【参考方案1】:您已经导入 BeautifulSoup,为什么不使用 css 类选择器?
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/', headers = 'User-Agent':'Mozilla/5.0')
soup = bs(r.content, 'lxml')
info = [i.text.strip() for i in soup.select('.dgi-')]
你也可以打印成
for i in soup.select('.dgi-'):
print(i.text.strip())
【讨论】:
如果我需要将输出分成 5 个部分?像门 = 4 门轿车,车身 = 4 缸,1.8 升 输出部分。这是一个列表。 我添加了一个编辑,这样你就可以看到如何在没有列表理解的情况下打印 如何将列表的每个元素分配给变量。它试图这样做info = [i.text.strip() for i in soup.select('.dgi-')] car_data['0']=info.split(" ")[0] car_data['1']=info.split(" ")[1] car_data['2']=info.split(" ")[2] car_data['3']=info.split(" ")[3]
【参考方案2】:
find_all()
-返回元素的集合。
strip()
- Python 的内置函数用于从字符串中删除所有前导和尾随空格。
例如
import requests
from bs4 import BeautifulSoup
cars = []
urls = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/']
for url in urls:
car_data=[]
headers = 'User-Agent':'Mozilla/5.0'
page = (requests.get(url, headers=headers))
soup = BeautifulSoup(page.content,'lxml')
car_obj = soup.find("div",'class':'r-center-pane').find("div",\
'class':'micro-spec').find("div",'class':'columns').find_all("dd")
for x in car_obj:
text = x.text.strip()
if text != "":
car_data.append(text)
cars.append(car_data)
print(cars)
O/P:
[['4 Door Sedan', '4 Cylinder, 1.8 Litre', 'Constantly Variable Transmission,
Front Wheel Drive', 'Petrol - Unleaded ULP', '6.4 L/100km']]
【讨论】:
以上是关于如何使用汤从页面中提取列中的数据的主要内容,如果未能解决你的问题,请参考以下文章
如何使用美丽的汤从 kick starter 中获取以下数据?