如何使用涉及html表的Beautiful Soup从页面中抓取产品信息[关闭]

Posted 2023-03-05

技术标签:

【中文标题】如何使用涉及html表的Beautiful Soup从页面中抓取产品信息[关闭]【英文标题】：How to scrape the product information from the page using Beautiful Soup in which html table are involved [closed] 【发布时间】：2021-10-17 20:00:49 【问题描述】：

import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl='https://books.toscrape.com/'
headers =
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/91.0.4472.114 Safari/537.36'

r =requests.get('https://books.toscrape.com/' )
soup=BeautifulSoup(r.content, 'html.parser')
productlinks=[]
Title=[]
Brand=[]
tra = soup.find_all('article',class_='product_pod')
for links in tra:
    for link in links.find_all('a',href=True)[1:]:
        comp=baseurl+link['href']
        productlinks.append(comp)

for link in productlinks:
    r =requests.get(link,headers=headers)
    soup=BeautifulSoup(r.content, 'html.parser')
    try:
        title=soup.find('h3').text
    except:
        title=' '
    Title.append(title)
    price=soup.find('p',class_="price_color").text.replace('£','').replace(',','').strip()
    Brand.append(price)

df = pd.DataFrame(
    
    "Title": Title, "Price": price
)
print(df)

上述脚本按预期工作，但我想抓取每个产品的信息，例如upc，product type example 获取这些单页的信息 https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html 刮upc,product type等...所有其他信息都在产品信息

【问题讨论】：

我没有看到任何尝试在多个页面上操作的代码。到目前为止，您在这方面做了哪些尝试？请修正您的代码。 【参考方案1】：

您可以在 URL 中使用start= 参数来获取下一页：

import requests
from bs4 import BeautifulSoup

for page in range(0, 10):  # <-- increase number of pages here
    r = requests.get(
        "https://pk.indeed.com/jobs?q=&l=Lahore&start=".format(page * 10)
    )
    soup = BeautifulSoup(r.content, "html.parser")
    title = soup.find_all("h2", class_="jobTitle")

    for i in title:
        print(i.text)

打印：

Data Entry Work Online
newAdmin Assistant
newNCG Agent
Data Entry Operator
newResearch Associate Electrical
Administrative Assistant (Executive Assistant)
Admin Assistant Digitally
newIT Officer (Remote Work)
OFFICE ASSISTANT
Cash Officer - Lahore Region
newDeputy Manager Finance
Admin Assistant
Lab Assistant
newProduct Portfolio & Customer Service Specialist
Front Desk Officer
newRelationship Manager, Recovery
MANAGEMENT TRAINEE PROGRAM
Email Support Executive (International)
Data Entry Operator
Admin officer

...and so on.

【讨论】：

以上是关于如何使用涉及html表的Beautiful Soup从页面中抓取产品信息[关闭]的主要内容，如果未能解决你的问题，请参考以下文章