使用 selenium 和 bs4 抓取网站不起作用

Posted

技术标签:

【中文标题】使用 selenium 和 bs4 抓取网站不起作用【英文标题】:Scraping a site using selenium and bs4 does not work 【发布时间】:2022-01-16 12:12:43 【问题描述】:

我正在尝试抓取以下网站:

https://cve.mitre.org/cve/data_feeds

    driver = webdriver.Chrome()  # brew install chromedirver
    driver.get(self._SCRAPE_WEBSITE_URL)
    page = driver.page_source
    soup = BeautifulSoup(page, 'lxml')
    cve = soup.find_all("li", "class": "timeline-TweetList-tweet customisable-border")
    print(cve)

但我的打印返回一个空列表。

有什么想法吗?

【问题讨论】:

你想抓取哪些信息? 【参考方案1】:

您尝试访问的元素位于 iframe 中。为了访问它们,您必须切换到该 iframe。使用 Selenium,可以按以下方式完成:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()  # brew install chromedirver
wait = WebDriverWait(driver, 20)

driver.get(self._SCRAPE_WEBSITE_URL)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[@id='twitter-widget-0']")))
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "li.timeline-TweetList-tweet.customisable-border")))
cve = driver.find_elements(By.CSS_SELECTOR, "li.timeline-TweetList-tweet.customisable-border")

我想这也可以用bs4完成,但是我对bs4不够熟悉,所以我不知道如何用bs4切换到iframe。 处理完 iframe 内容后不要忘记切换到默认内容。

【讨论】:

【参考方案2】:

您的打印返回一个空列表,因为第二个iframe 下的html dom 并且您需要切换以获取数据。现在它工作正常。

您可以安装管理器:pip install webdriver-manager 并运行代码

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time


url = 'https://cve.mitre.org/cve/data_feeds'

cm = ChromeDriverManager().install()
driver = webdriver.Chrome(cm)

driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(5)

iframe = driver.find_elements_by_tag_name('iframe')[1]
driver.switch_to.frame(iframe)


soup = BeautifulSoup(driver.page_source, 'html.parser')
cves =soup.find_all("li", "class": "timeline-TweetList-tweet customisable-border")
for cve in cves:
    tweet_text= cve.select_one('p').text
    print(tweet_text)

结果:

CVE-2021-44833 The CLI 1.0.0 for Amazon AWS OpenSearch has weak permissions for the configuration file. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44833 …

CVE-2021-41805 HashiCorp Consul Enterprise before 1.8.17, 1.9.x before 1.9.11, and 1.10.x before 1.10.4 has Incorrect Access Control. An ACL token (with the default operator:write permissions) in one namespace can be used for unintended privilege ... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-41805 …

CVE-2021-44515 Zoho ManageEngine Desktop Central is vulnerable to authentication bypass, leading to remote code 
execution on the server, as exploited in the wild in December 2021. For Enterprise builds 10.1.2127.17 and earlier, upgrade to 10.1.212... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44515 …

CVE-2021-4097 phpservermon is vulnerable to Improper Neutralization of CRLF Sequences https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-4097 …

CVE-2021-4092 yetiforcecrm is vulnerable to Cross-Site Request Forgery (CSRF) https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-4092 …

CVE-2021-41242 OpenOlat is a web-basedlearning management system. A path traversal vulnerability exists in OpenOlat prior to versions 15.5.12 and 16.0.5. By providing a filename that contains a relative path as a parameter in some REST methods, it... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-41242 …

CVE-2021-26340 A malicious hypervisor in conjunction with an unprivileged attacker process inside an SEV/SEV-ES 
guest VM may fail to flush the Translation Lookaside Buffer (TLB) resulting in unexpected behavior inside the virtual machine (VM). https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-26340 …

CVE-2020-12890 Improper handling of pointers in the System Management Mode (SMM) handling code may allow for a privileged attacker with physical or administrative access to potentially manipulate the AMD Generic Encapsulated Software Architecture ... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-12890 …

CVE-2021-43815 Grafana is an open-source platform for monitoring and observability. Grafana prior to versions 8.3.2 and 7.5.12 has a directory traversal for arbitrary .csv files. It only affects in... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-43815 …

CVE-2021-4089 snipe-it is vulnerable to Improper Access Control https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-4089 …

CVE-2021-23700 All versions of package merge-deep2 are vulnerable to Prototype Pollution via the mergeDeep() function. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23700 …

CVE-2021-23663 All versions of package sey are vulnerable to Prototype Pollution via the deepmerge() function. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23663 …

CVE-2021-23639 The package md-to-pdf before 5.0.0 are vulnerable to Remote Code Execution (RCE) due to utilizing the library gray-matter to parse front matter content, without disabling the JS engine. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23639 …

CVE-2021-23561 All versions of package comb are vulnerable to Prototype Pollution via the deepMerge() function. 
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23561 …

CVE-2021-23463 The package com.h2database:h2 from 0 and before 2.0.202 are vulnerable to XML External Entity (XXE) Injection via the org.h2.jdbc.JdbcSQLXML class object, when it receives parsed str... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23463 …

CVE-2021-27984 In Pluck-4.7.15 admin background a remote command execution vulnerability exists when uploading files. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-27984 …

【讨论】:

以上是关于使用 selenium 和 bs4 抓取网站不起作用的主要内容,如果未能解决你的问题,请参考以下文章

“AttributeError:‘str’对象没有属性‘descendants’错误,使用 bs4 和 selenium 进行自动抓取

使用 BS4 或 Selenium 从 finishline.com 抓取网页

在 Python 中使用 BS4、Selenium 抓取动态数据并避免重复

Flask初接触

使用 Selenium 和 Python 进行用户输入的网页抓取动态网站

Selenium 无法使用 python 抓取 Shopee 电子商务网站