Selenium/BeautifulSoup - Python - 循环多个页面

Posted

技术标签:

【中文标题】Selenium/BeautifulSoup - Python - 循环多个页面【英文标题】:Selenium/BeautifulSoup - Python - Loop Through Multiple Pages 【发布时间】:2019-05-26 16:16:52 【问题描述】:

我花了大部分时间研究和测试在零售商网站上循环浏览一组产品的最佳方法。

虽然我成功地收集了第一页上的一组产品(和属性),但我一直难以找出循环浏览网站页面以继续我的抓取的最佳方法。

根据下面的代码,我尝试使用“while”循环和 Selenium 来单击网站的“下一页”按钮,然后继续收集产品。

问题是我的代码仍然没有超过第 1 页。

我在这里犯了一个愚蠢的错误吗?阅读此站点上的 4 或 5 个类似示例,但没有一个足够具体,无法在此处提供解决方案。

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products.clear()
hyperlinks.clear()
reviewCounts.clear()
starRatings.clear()

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1


html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('li', class_ = 'products_grid')


while (pageCounter < maxPageCount):
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    counterProduct +=1
    print(counterProduct)

【问题讨论】:

我稍后将不得不测试和玩这个因为不在电脑附近,但我注意到的第一件事是你的 html_souo 和 prod_containers 不在循环中。您解析,然后迭代第一页,但在第一页之后的任何时候都不要这样做。一旦你从一个页面遍历它,在你点击到下一页之后,你需要再次使用 products_grid 解析 html 和 find_all。所以我会把整个语句移到你的 html_soup 行之前。 我也认为你的意思是“pageCounter += 1”,而不是“counterProduct”? 抱歉有错别字。将 while 语句移到 html_soup 之前 【参考方案1】:

每次“点击”下一页时都需要解析。因此,您需要将其包含在您的 while 循环中,否则您将继续迭代第一页,即使它点击到下一页,因为 prod_containers 对象永远不会改变。

其次,按照您的方式,您的 while 循环将永远不会停止,因为您设置 pageCounter = 0,但永远不会增加它...它将永远是

我在代码中修复了这两件事并运行了它,它似乎已经工作并解析了第 1 页到第 5 页。

from selenium import webdriver
from bs4 import BeautifulSoup
import re

driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0

html_soup = BeautifulSoup(driver.page_source, 'html.parser')
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1

prod_containers = html_soup.find_all('li', class_ = 'products_grid')


while (pageCounter < maxPageCount):
    html_soup = BeautifulSoup(driver.page_source, 'html.parser')
    prod_containers = html_soup.find_all('li', class_ = 'products_grid')
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            name = name.strip()
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    pageCounter +=1
    print(pageCounter)

【讨论】:

谢谢@chitown88,我运行了代码,发现在某些情况下它会跳过循环中的产品。我应用了睡眠逻辑并稍微调整了我对 maxpage 逻辑的定义,现在一切正常。非常感谢第二双眼睛!【参考方案2】:

好的,当从.py 文件单独运行时,这段代码不会运行,我猜你是在 iPython 或类似环境中运行它,并且已经初始化了这些变量并导入了库。

首先,您需要包含正则表达式包:

import re

此外,所有这些 clear() 语句都不是必需的,因为无论如何您都初始化了所有这些列表(实际上 python 无论如何都会抛出错误,因为当您对它们调用 clear 时这些列表尚未定义)

你还需要初始化counterProduct:

counterProduct = 0

最后你必须在你的代码中引用它之前为你的html_soup 设置一个值:

html_soup = BeautifulSoup(driver.page_source, 'html.parser')

这是正确的代码,它正在工作:

from selenium import webdriver
from bs4 import BeautifulSoup
import re

driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1
prod_containers = html_soup.find_all('li', class_ = 'products_grid')
counterProduct = 0
while (pageCounter < maxPageCount):
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    counterProduct +=1
    print(counterProduct)

【讨论】:

以上是关于Selenium/BeautifulSoup - Python - 循环多个页面的主要内容,如果未能解决你的问题,请参考以下文章

Python爬虫初探 - selenium+beautifulsoup4+chromedriver爬取需要登录的网页信息

爬虫--python3.6+selenium+BeautifulSoup实现动态网页的数据抓取,适用于对抓取频率不高的情况

如何通过 Python Selenium BeautifulSoup 从网站中提取证券价格作为文本

python爬虫之selenium+BeautifulSoup库,爬取搜索内容并保存excel

python爬虫之selenium+BeautifulSoup库,爬取搜索内容并保存excel

Selenium WebDriverException 中的空错误消息