在 Python 中使用 Selenium 导航并使用 BeautifulSoup 进行抓取

Posted

技术标签:

【中文标题】在 Python 中使用 Selenium 导航并使用 BeautifulSoup 进行抓取【英文标题】:Navigate with Selenium and scrape with BeautifulSoup in Python 【发布时间】:2019-08-07 09:47:55 【问题描述】:

好的,这就是我要归档的内容:

    使用动态过滤的搜索结果列表调用 URL 点击第一个搜索结果(5/页)

    抓取标题、段落和图像并将它们作为 json 对象存储在单独的文件中,例如

    “标题”:“单个条目的标题元素”, “内容”:“按 DOM 顺序排列的段落和图像以及单个条目”

    导航回搜索结果概览页面并重复步骤 2 - 3

    在 5/5 的结果被抓取后转到下一页(点击分页链接) 重复步骤 2 - 5,直到没有任何条目

再次可视化预期的内容:

到目前为止我所拥有的是:

#import libraries
from selenium import webdriver
from bs4 import BeautfifulSoup

#URL
url = "https://URL.com"

#Create a browser session
driver = webdriver.Chrome("PATH TO chromedriver.exe")
driver.implicitly_wait(30)
driver.get(url)

#click consent btn on destination URL ( overlays rest of the content )
python_consentButton = driver.find_element_by_id('acceptAllCookies')
python_consentButton.click() #click cookie consent btn

#Seleium hands the page source to Beautiful Soup
soup_results_overview = BeautifulSoup(driver.page_source, 'lxml')


for link in soup_results_overview.findAll("a", class_="searchResults__detail"):

  #Selenium visits each Search Result Page
  searchResult = driver.find_element_by_class_name('searchResults__detail')
  searchResult.click() #click Search Result

  #Ask Selenium to go back to the search results overview page
  driver.back()

#Tell Selenium to click paginate "next" link 
#probably needs to be in a sourounding for loop?
paginate = driver.find_element_by_class_name('pagination-link-next')
paginate.click() #click paginate next

driver.quit()

问题 每次 Selenium 导航回搜索结果概览页面时,列表计数都会重置 所以它点击第一个条目 5 次,导航到接下来的 5 个项目并停止

这可能是递归方法的预定情况,不确定。

感谢任何有关如何解决此问题的建议。

【问题讨论】:

可以分享网址吗? 是的,为什么不呢:cst.com/solutions#size=5&TemplateName=Application+Article @derp :请尝试我的解决方案。希望这会有所帮助。 【参考方案1】:

您只能使用requestsBeautifulSoup 进行刮擦,而无需使用 Selenium。它会更快并且消耗更少的资源:

import json
import requests
from bs4 import BeautifulSoup

# Get 1000 results
params = "$filter": "TemplateName eq 'Application Article'", "$orderby": "ArticleDate desc", "$top": "1000",
          "$inlinecount": "allpages", 
response = requests.get("https://www.cst.com/odata/Articles", params=params).json()

# iterate 1000 results
articles = response["value"]
for article in articles:
    article_json = 
    article_content = []

    # title of article
    article_title = article["Title"]
    # article url
    article_url = str(article["Url"]).split("|")[1]
    print(article_title)

    # request article page and parse it
    article_page = requests.get(article_url).text
    page = BeautifulSoup(article_page, "html.parser")

    # get header
    header = page.select_one("h1.head--bordered").text
    article_json["Title"] = str(header).strip()
    # get body content with images links and descriptions
    content = page.select("section.content p, section.content img, section.content span.imageDescription, "
                          "section.content  em")
    # collect content to json format
    for x in content:
        if x.name == "img":
            article_content.append("https://cst.com/solutions/article/" + x.attrs["src"])
        else:
            article_content.append(x.text)

    article_json["Content"] = article_content

    # write to json file
    with open(f"article_json['Title'].json", 'w') as to_json_file:
         to_json_file.write(json.dumps(article_json))

  print("the end")

【讨论】:

上面的答案是一种清晰、精简和结构化的方法,涵盖了最初要求的所有内容。 OData 可查询对象对我来说相当新,特别是在这种情况下非常有用。【参考方案2】:

我有一个解决方案给你。获取链接的href 值,然后执行driver.get(url)

而不是这个。

for link in soup_results_overview.findAll("a", class_="searchResults__detail"):

  #Selenium visits each Search Result Page
  searchResult = driver.find_element_by_class_name('searchResults__detail')
  searchResult.click() #click Search Result

  #Ask Selenium to go back to the search results overview page
  driver.back()

试试这个。

for link in soup_results_overview.findAll("a", class_="searchResults__detail"):
    print(link['href'])
    driver.get(link['href'])
    driver.back()

这里我在导航之前打印了网址。

https://cst.com/solutions/article/sar-spherical-phantom-model
  https://cst.com/solutions/article/pin-fed-four-edges-gap-coupled-microstrip-antenna-magus
  https://cst.com/solutions/article/printed-self-matched-normal-mode-helix-antenna-antenna-magus
  https://cst.com/solutions/article/broadband-characterization-of-launchers
  https://cst.com/solutions/article/modal-analysis-of-a-dielectric-2-port-filter

【讨论】:

【参考方案3】:

此解决方案导航到每个链接,抓取标题和段落,存储图像 url,并将所有图像以.pngs 的形式下载到机器:

from bs4 import BeautifulSoup as soup
import requests, re
from selenium import webdriver
def scrape_page(_d, _link):
   _head, _paras = _d.find('h1', 'class':'head--bordered').text, [i.text for i in _d.find_all('p')]
   images = [i.img['src'] for i in _d.find_all('a', 'class':'fancybox')]
   for img in images:
      _result, _url = requests.get(f'_linkimg').content, re.findall("\w+\.ashx$", img)
      if _url:
        with open('electroresults/.png'.format(_url[0][:-5]), 'wb') as f:
          f.write(_result)    
   return _head, _paras, images   


d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.cst.com/solutions#size=5&TemplateName=Application+Article')
results, page, _previous = [], 1, None
while True:
  _articles = [i.get_attribute('href') for i in d.find_elements_by_class_name('searchResults__detail')]
  page_results = []
  previous = d.current_url
  for article in _articles:
    d.get(article)
    try:
      d.find_elements_by_class_name('interaction')[0].click()
    except:
      pass
    page_results.append(dict(zip(['title', 'paragraphs', 'imgs'], scrape_page(soup(d.page_source, 'html.parser'), d.current_url))))
    results.append(page_results)
  d.get(previous)
  _next = d.find_elements_by_class_name('pagination-link-next')
  if _next:
    _next[0].click()
  else:
    break

输出(由于 SO 字数限制,仅首页前几篇文章):

['title': '\n        Predicting SAR Behavior using Spherical Phantom Models\n    ', 'paragraphs': ['', '\nAntenna Magus is a software tool to help accelerate the antenna design and modelling process. It increases efficiency by helping the engineer to make a more informed choice of antenna element, providing a good starting design.\n', '', '', '\n                        IdEM is a user friendly tool for the generation of macromodels of linear lumped multi-port structures (e.g., via fields, connectors, packages, discontinuities, etc.), known from their input-output port responses. The raw characterization of the structure can come from measurement or simulation, either in frequency domain or in time domain.\n                    ', '', '', '\n                        FEST3D is a software tool capable of analysing complex passive microwave components based on waveguide technology (including multiplexers, couplers and filters) in very short computational times with high accuracy. This suite offers all the capabilities needed for the design of passive components such as optimization and tolerance analysis. Moreover, FEST3D advanced synthesis tools allow designing bandpass, dual-mode and lowpass filters from user specifications.\n                    ', '', '', '\n                        SPARK3D is a unique simulation tool for determining the RF breakdown power level in a wide variety of passive devices, including those based on cavities, waveguides, microstrip and antennas. Field results from CST STUDIO SUITE® simulations can be imported directly into SPARK3D to analyse vacuum breakdown (multipactor) and gas discharge. From this, SPARK3D calculates the maximum power that the device can handle without causing discharge effects.\n                    ', '', '', '\nEasy-to-use matching circuit optimization and antenna analysis software\n                        Optenni Lab is a professional software tool with innovative analysis features to increase the productivity of engineers requiring matching circuits. It can, e.g., speed up the antenna design process and provide antennas with optimal total performance. Optenni Lab offers fast fully-automatic matching circuit optimization tools, including automatic generation of multiple optimal topologies, estimation of the obtainable bandwidth of antennas and calculation of the worst-case isolation in multi-antenna systems.\n                    ', '', '', '\n                        The ability to visualize electromagnetic fields intuitively in 3D and also the possibility to demonstrate in a straightforward way the effect of parameter changes are obvious benefits in teaching. To support learning, teaching and research at academic institutions, CST offers four types of licenses, namely the free CST STUDIO SUITE®Student Edition, a Clas-s-room license, an Educational license and an Extended license. \n                    ', '', '', '\n                        The CST STUDIO SUITE® Student Edition has been developed with the aim of introducing you to the world of electromagnetic simulation, making Maxwell’s equations easier to understand than ever.\n                    ', '', '', '\n                        Below you will find several examples which were selected from some commonly used textbooks. Each example contains a short description of the theory, detailed information on how to construct the model, a video showing how to construct the model, and the fully constructed model ready for you to download.\n                    ', '', '', '\n                        In acknowledgement of the importance of university research and the impact of groundbreaking publications on the reputation of both author and tool used for the research, CST announces the endowment of a University Publication Award.\n                    ', '', '', "\n                        Regular training courses are held in CST's offices in Asia, Europe, and North America. Please check the local websites for detail of trainings in China, Korea and Japan. Advance registration is normally required.\n                    ", '', '', '\nCST exhibits at events around the globe. See a list of exhibitions CST is attending where you can speak to our sales and support staff and learn more about our products and their applications.\n', '', '', '\nThroughout the year, CST simulation experts present eSeminars on the applications, features and usage of our software. You can also view past eSeminars by searching our archive and filtering for the markets or industries that interest you most.\n\n', '', '', '\n                        CST hosts workshops in multiple languages and in countries around the world. Workshops provide an opportunity to learn about specific applications and refresh your skills with experienced CST support staff.\n                    ', '', '', '\n                        The CST user conference offers an informal and enlightening environment where developers and researchers using CST STUDIO SUITE® tools can exchange ideas and talk with CST staff about future developments.\n                    ', '', 'facebooklinkedinswymtwitteryoutuberss', 'Events', 'Due to the fact that measurements in true biological heads typically cannot be carried out, SAR norms for mobile phones or EMI problems are commonly defined in terms of standardized phantom models. In the easiest case, only spherical structures are considered. To predict the SAR behavior of a new product already during the design stage, it is desirable to include the phantom head in the EM simulations. ', 'The following examples\xa0investigate two spherical phantom models, a basic one that only contains of tissue material inside a glass sphere and a more complex one that has two\xa0additional layers of bone and tissue.\xa0\xa0A dipole antenna is used for the excitation and\xa0is displayed as a yellow line in the following picture.', 'The SAR distribution is simulated at 835 MHz and visualized in the figure below. A comparison of the SAR values over a radial line shows good agreement with the measurement of the same structure.', 'For the following simulation a more complex model including a simplified skull is used.', 'A comparison of the SAR values at 1.95 GHz on an off-axis path shows\xa0a significant difference between the basic homogeneous model and the more complex one. Since the values are higher, the simplified model may not be sufficient in all cases.', ' Go to Article', ' Go to Article', ' Go to Article', ' Go to Article', ' Go to Article', '\n        Please read our\n        Privacy Statement\xa0|\xa0\n        Impressum \xa0|\xa0\n        Sitemap \xa0|\xa0\n         © 2019 Dassault Systemes Deutschland GmbH. All rights reserved.\n    ', 'Your session has expired. Redirecting you to the login page...', '\n                We use cookie to operate this website, improve its usability, personalize your experience, and track visits. By continuing to use this site, you are consenting to use of cookies. You have the possibility to manage the parameters and choose whether to accept certain cookies while on the site. For more information, please read our updated privacy policy\n', 'When you browse our website, cookies are enabled by default and data may be read or stored locally on your device. You can set your preferences below:', 'These cookies enable additional functionality like saving preferences, allowing social interactions and analyzing usage for site optimization.', 'These cookies enable us and third parties to serve ads that are relevant to your interests.'], 'imgs': ['~/media/B692C95635564BBDA18AFE7C35D3CC7E.ashx', '~/media/DC7423B9D92542CF8254365D9C83C9E7.ashx', '~/media/54E5C0BE872B411EBDC1698E19894670.ashx', '~/media/114789FC714042A89019C5E41E64ADEE.ashx', '~/media/B9AF3151613C44D2BFE1B5B9B6504885.ashx'], 'title': '\n        Pin-fed Four Edges Gap Coupled Microstrip Antenna | Antenna Magus\n    ', 'paragraphs': ['', '\nAntenna Magus is a software tool to help accelerate the antenna design and modelling process. It increases efficiency by helping the engineer to make a more informed choice of antenna element, providing a good starting design.\n', '', '', '\n                        IdEM is a user friendly tool for the generation of macromodels of linear lumped multi-port structures (e.g., via fields, connectors, packages, discontinuities, etc.), known from their input-output port responses. The raw characterization of the structure can come from measurement or simulation, either in frequency domain or in time domain.\n                    ', '', '', '\n                        FEST3D is a software tool capable of analysing complex passive microwave components based on waveguide technology (including multiplexers, couplers and filters) in very short computational times with high accuracy. This suite offers all the capabilities needed for the design of passive components such as optimization and tolerance analysis. Moreover, FEST3D advanced synthesis tools allow designing bandpass, dual-mode and lowpass filters from user specifications.\n                    ', '', '', '\n                        SPARK3D is a unique simulation tool for determining the RF breakdown power level in a wide variety of passive devices, including those based on cavities, waveguides, microstrip and antennas. Field results from CST STUDIO SUITE® simulations can be imported directly into SPARK3D to analyse vacuum breakdown (multipactor) and gas discharge. From this, SPARK3D calculates the maximum power that the device can handle without causing discharge effects.\n                    ', '', '', '\nEasy-to-use matching circuit optimization and antenna analysis software\n                        Optenni Lab is a professional software tool with innovative analysis features to increase the productivity of engineers requiring matching circuits. It can, e.g., speed up the antenna design process and provide antennas with optimal total performance. Optenni Lab offers fast fully-automatic matching circuit optimization tools, including automatic generation of multiple optimal topologies, estimation of the obtainable bandwidth of antennas and calculation of the worst-case isolation in multi-antenna systems.\n                    ', '', '', '\n                        The ability to visualize electromagnetic fields intuitively in 3D and also the possibility to demonstrate in a straightforward way the effect of parameter changes are obvious benefits in teaching. To support learning, teaching and research at academic institutions, CST offers four types of licenses, namely the free CST STUDIO SUITE®Student Edition, a Clas-s-room license, an Educational license and an Extended license. \n                    ', '', '', '\n                        The CST STUDIO SUITE® Student Edition has been developed with the aim of introducing you to the world of electromagnetic simulation, making Maxwell’s equations easier to understand than ever.\n                    ', '', '', '\n                        Below you will find several examples which were selected from some commonly used textbooks. Each example contains a short description of the theory, detailed information on how to construct the model, a video showing how to construct the model, and the fully constructed model ready for you to download.\n                    ', '', '', '\n                        In acknowledgement of the importance of university research and the impact of groundbreaking publications on the reputation of both author and tool used for the research, CST announces the endowment of a University Publication Award.\n                    ', '', '', "\n                        Regular training courses are held in CST's offices in Asia, Europe, and North America. Please check the local websites for detail of trainings in China, Korea and Japan. Advance registration is normally required.\n                    ", '', '', '\nCST exhibits at events around the globe. See a list of exhibitions CST is attending where you can speak to our sales and support staff and learn more about our products and their applications.\n', '', '', '\nThroughout the year, CST simulation experts present eSeminars on the applications, features and usage of our software. You can also view past eSeminars by searching our archive and filtering for the markets or industries that interest you most.\n\n', '', '', '\n                        CST hosts workshops in multiple languages and in countries around the world. Workshops provide an opportunity to learn about specific applications and refresh your skills with experienced CST support staff.\n                    ', '', '', '\n                        The CST user conference offers an informal and enlightening environment where developers and researchers using CST STUDIO SUITE® tools can exchange ideas and talk with CST staff about future developments.\n                    ', '', 'facebooklinkedinswymtwitteryoutuberss', 'Events', 'Although microstrip antennas are very popular in the microwave frequency range because of their simplicity and compatibility with circuit board technology, their limited bandwidth often restricts their usefulness.', 'Various methods have been suggested to overcome this limitation – including the use of gap- or direct-coupled parasitic patches. In the FEGCOMA, these parasitic patches are placed alongside all four edges of the driven patch element. The introduction of parasitic patches of slightly different resonant lengths yields further resonances improving the bandwidth and gain of the standard patch. In this case, the structure is optimized to obtain a well-defined, designable bandwidth with near-optimally spaced zeros. Typical gain values of 10 dBi may be expected, with a designable fractional impedance bandwidth between 12 % and 30 %....', '', ' Go to Article', ' Go to Article', ' Go to Article', ' Go to Article', ' Go to Article', '\n        Please read our\n        Privacy Statement\xa0|\xa0\n        Impressum \xa0|\xa0\n        Sitemap \xa0|\xa0\n         © 2019 Dassault Systemes Deutschland GmbH. All rights reserved.\n    ', 'Your session has expired. Redirecting you to the login page...', '\n                We use cookie to operate this website, improve its usability, personalize your experience, and track visits. By continuing to use this site, you are consenting to use of cookies. You have the possibility to manage the parameters and choose whether to accept certain cookies while on the site. For more information, please read our updated privacy policy\n', 'When you browse our website, cookies are enabled by default and data may be read or stored locally on your device. You can set your preferences below:', 'These cookies enable additional functionality like saving preferences, allowing social interactions and analyzing usage for site optimization.', 'These cookies enable us and third parties to serve ads that are relevant to your interests.'], 'imgs': ['http://www.antennamagus.com/database/antennas/341/Patch_FEGCOMA_Pin_small.png', 'http://www.antennamagus.com/images/Newsletter2019-0/FEGCOMA_3D_with_plus.png', 'http://www.antennamagus.com/images/Newsletter2019-0/FEGCOMA_s11_with_plus.png'], 'title': '\n        Printed Self-Matched Normal Mode Helix Antenna | Antenna Magus\n    ', 'paragraphs': ['', '\nAntenna Magus is a software tool to help accelerate the antenna design and modelling process. It increases efficiency by helping the engineer to make a more informed choice of antenna element, providing a good starting design.\n', '', '', '\n                        IdEM is a user friendly tool for the generation of macromodels of linear lumped multi-port structures (e.g., via fields, connectors, packages, discontinuities, etc.), known from their input-output port responses. The raw characterization of the structure can come from measurement or simulation, either in frequency domain or in time domain.\n                    ', '', '', '\n                        FEST3D is a software tool capable of analysing complex passive microwave components based on waveguide technology (including multiplexers, couplers and filters) in very short computational times with high accuracy. This suite offers all the capabilities needed for the design of passive components such as optimization and tolerance analysis. Moreover, FEST3D advanced synthesis tools allow designing bandpass, dual-mode and lowpass filters from user specifications.\n                    ', '', '', '\n                        SPARK3D is a unique simulation tool for determining the RF breakdown power level in a wide variety of passive devices, including those based on cavities, waveguides, microstrip and antennas. Field results from CST STUDIO SUITE® simulations can be imported directly into SPARK3D to analyse vacuum breakdown (multipactor) and gas discharge. From this, SPARK3D calculates the maximum power that the device can handle without causing discharge effects.\n                    ', '', '', '\nEasy-to-use matching circuit optimization and antenna analysis software\n                        Optenni Lab is a professional software tool with innovative analysis features to increase the productivity of engineers requiring matching circuits. It can, e.g., speed up the antenna design process and provide antennas with optimal total performance. Optenni Lab offers fast fully-automatic matching circuit optimization tools, including automatic generation of multiple optimal topologies, estimation of the obtainable bandwidth of antennas and calculation of the worst-case isolation in multi-antenna systems.\n                    ', '', '', '\n                        The ability to visualize electromagnetic fields intuitively in 3D and also the possibility to demonstrate in a straightforward way the effect of parameter changes are obvious benefits in teaching. To support learning, teaching and research at academic institutions, CST offers four types of licenses, namely the free CST STUDIO SUITE®Student Edition, a Clas-s-room license, an Educational license and an Extended license. \n                    ', '', '', '\n                        The CST STUDIO SUITE® Student Edition has been developed with the aim of introducing you to the world of electromagnetic simulation, making Maxwell’s equations easier to understand than ever.\n                    ', '', '', '\n                        Below you will find several examples which were selected from some commonly used textbooks. Each example contains a short description of the theory, detailed information on how to construct the model, a video showing how to construct the model, and the fully constructed model ready for you to download.\n                    ', '', '', '\n                        In acknowledgement of the importance of university research and the impact of groundbreaking publications on the reputation of both author and tool used for the research, CST announces the endowment of a University Publication Award.\n                    ', '', '', "\n                        Regular training courses are held in CST's offices in Asia, Europe, and North America. Please check the local websites for detail of trainings in China, Korea and Japan. Advance registration is normally required.\n                    ", '', '', '\nCST exhibits at events around the globe. See a list of exhibitions CST is attending where you can speak to our sales and support staff and learn more about our products and their applications.\n', '', '', '\nThroughout the year, CST simulation experts present eSeminars on the applications, features and usage of our software. You can also view past eSeminars by searching our archive and filtering for the markets or industries that interest you most.\n\n', '', '', '\n                        CST hosts workshops in multiple languages and in countries around the world. Workshops provide an opportunity to learn about specific applications and refresh your skills with experienced CST support staff.\n                    ', '', '', '\n                        The CST user conference offers an informal and enlightening environment where developers and researchers using CST STUDIO SUITE® tools can exchange ideas and talk with CST staff about future developments.\n                    ', '', 'facebooklinkedinswymtwitteryoutuberss', 'Events', 'Normal mode helix antennas (NMHA) are often used for handheld radio transceivers and mobile communications applications. The printed self-matched NMHA is naturally matched to 50 Ω, thus avoiding the typical design challenge of matching similar structures at resonance.', 'It exhibits properties similar to other NMHAs, namely: It is compact (with the total height being typically 0.14 λ), it is vertically polarized and omni-directional and has a bandwidth of approximately 3%.', 'The helical structure consists of two (inner and outer) metallic helical strips of equal width, with a central dielectric section between them.', ' Go to Article', ' Go to Article', ' Go to Article', ' Go to Article', ' Go to Article', '\n        Please read our\n        Privacy Statement\xa0|\xa0\n        Impressum \xa0|\xa0\n        Sitemap \xa0|\xa0\n         © 2019 Dassault Systemes Deutschland GmbH. All rights reserved.\n    ', 'Your session has expired. Redirecting you to the login page...', '\n                We use cookie to operate this website, improve its usability, personalize your experience, and track visits. By continuing to use this site, you are consenting to use of cookies. You have the possibility to manage the parameters and choose whether to accept certain cookies while on the site. For more information, please read our updated privacy policy\n', 'When you browse our website, cookies are enabled by default and data may be read or stored locally on your device. You can set your preferences below:', 'These cookies enable additional functionality like saving preferences, allowing social interactions and analyzing usage for site optimization.', 'These cookies enable us and third parties to serve ads that are relevant to your interests.'], 'imgs': ['http://www.antennamagus.com/database/antennas/342/Printed_Matched_NMHA_small.png', 'http://www.antennamagus.com/images/Newsletter2019-0/NMHA_3D_Farfield_with_plus.png', 'http://www.antennamagus.com/images/Newsletter2019-0/NMHA_2D_sketch_with_plus.png', 'http://www.antennamagus.com/images/Newsletter2019-0/NMHA_S11vsFrequency_with_plus.png']]

【讨论】:

【参考方案4】:

下面设置结果数为20,计算结果页数。它会单击下一步,直到所有页面都访问过。添加条件以确保页面已加载。我打印这些文章只是为了向您展示不同的页面。您可以使用此结构来创建所需的输出。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import math

startUrl = 'https://www.cst.com/solutions#size=20&TemplateName=Application+Article'
url = 'https://www.cst.com/solutions#size=20&TemplateName=Application+Article&page='
driver = webdriver.Chrome()
driver.get(startUrl)
driver.find_element_by_id('acceptAllCookies').click()
items = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".searchResults__detail")))
resultCount = int(driver.find_element_by_css_selector('[data-bind="text: resultsCount()"]').text.replace('items were found','').strip())
resultsPerPage = 20
numPages = math.ceil(resultCount/resultsPerPage)
currentCount = resultsPerPage
header = driver.find_element_by_css_selector('.searchResults__detail h3').text
test = header

for page in range(1, numPages + 1):
    if page == 1:   
        print([item.text for item in items])
        #do something with first page
    else:   
        driver.find_element_by_css_selector('.pagination-link-next').click()
        while header == test:
            try:
                header = driver.find_element_by_css_selector('.searchResults__detail h3').text
            except:
                continue

        items = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".searchResults__detail")))
        test = header
        #do something with next page
        print([item.text for item in items])
    if page == 4:  #delete later
        break #delete later

【讨论】:

【参考方案5】:

您没有在循环中的任何地方使用链接变量,只是告诉驱动程序找到顶部链接并单击它。 (当您使用单数 find_element 选择器并且有多个结果时,硒只会抓住第一个)。我认为您需要做的就是替换这些行

 searchResult = driver.find_element_by_class_name('searchResults__detail')
  searchResult.click()

link.click()

这有帮助吗?

好的..关于分页,您可以使用以下策略,因为“下一步”按钮消失了:

paginate = driver.find_element_by_class_name('pagination-link-next')
while paginate.is_displayed() == true:
    for link in soup_results_overview.findAll("a", class_="searchResults__detail"):

        #Selenium visits each Search Result Page
        searchResult.click() #click Search Result

        #Scrape the form with a function defined elsewhere
        scrape()

        #Ask Selenium to go back to the search results overview page
        driver.back()

    #Click pagination button after executing the for loop finishes on each page
    paginate.click()

【讨论】:

那不行,它返回:TypeError: 'NoneType' object is not callable 这让我很困惑。我现在实际上在移动设备上,但我很快就会在真正的浏览器中查看它。与此同时,我有一个愚蠢的问题:您是否确认您的 soup_results_overview 找到了链接? 现在我想起来了,你为什么需要漂亮的汤呢?似乎您可以只找到带有 selenium 的链接,而不用打扰整个页面源。诚然,我不熟悉 beautifulsoup,但似乎 selenium 可以完成这项工作。

以上是关于在 Python 中使用 Selenium 导航并使用 BeautifulSoup 进行抓取的主要内容,如果未能解决你的问题,请参考以下文章

如何在 python 中使用 Selenium 和 Beautifulsoup 解析网站? [关闭]

在带有 selenium 的 python 中,如何轮换 IP 地址?

使用 Python selenium 查找 href 链接

Robot Framework Selenium For 循环单击导航链接失败并出现 StaleElementReferenceException

博客导航

从 Selenium 触发时,Google Chrome 无法导航到指定的 URL