在 selenium 中加速网页抓取
Posted
技术标签:
【中文标题】在 selenium 中加速网页抓取【英文标题】:Speeding up webscraping in selenium 【发布时间】:2020-04-11 20:31:08 【问题描述】:我编写了一个小应用程序,它从 AWS 站点获取预留实例的价格,然后打印实例的名称及其价格(我只想拥有可转换的 3 年期限)。应用程序有效。但是,它的工作速度非常慢,可能是因为列表 allElements 包含 1925 个元素,后来我正在遍历它的所有元素。我想过滤代码中的数据(让我们只采用名称以 c5 开头的 Linux 实例)。如何更快地做到这一点?是否有机会加快就地过滤速度,并且不要将本网站的所有内容都放到 allElements 列表中?提前感谢您的帮助!
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import time
caps = DesiredCapabilities().FIREFOX
#caps["pageLoadStrategy"] = "normal" # complete
#caps["pageLoadStrategy"] = "eager" # interactive
caps["pageLoadStrategy"] = "none"
browser = webdriver.Firefox(desired_capabilities=caps)
browser.get('https://aws.amazon.com/ec2/pricing/reserved-instances/pricing/')
delay=3
time.sleep(10)
#browser.find_element_by_link_text('Windows').click()
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'aws-plc-content')))
print ("Page is ready!")
except TimeoutException:
print ("Loading took too much time!")
time.sleep(2)
allElements=browser.find_elements_by_class_name("aws-pricing-table-wrapper")
for el in allElements:
lista=el.text.split("\n")
indeks=lista.index("CONVERTIBLE 3-YEAR TERM")
prices=lista[indeks+2]
if lista[0].startswith('c5'):
print(lista[0])
print(prices.split()[4])
【问题讨论】:
【参考方案1】:我已经尝试过使用 Chrome 驱动程序,希望在 FF 上也能得到相同的结果。
您需要做几件事来实现这一目标。
-
无限循环并先滚动页面
使用
WebDriverWait
查找所有元素并附加到列表中,同时检查列表中没有重复项
一旦到达底部,它就会脱离循环。
使用以下 XPATH 来提供您所追求的输出。
使用element.get_attribute("textContent")
来获取值,如果你使用element.text
,你可能会得到一些空白字符串。
试试下面的代码。
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import time
browser = webdriver.Chrome()
browser.get('https://aws.amazon.com/ec2/pricing/reserved-instances/pricing/')
delay=10
try:
myElem = WebDriverWait(browser, delay).until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.aws-plc-content')))
print ("Page is ready!")
except TimeoutException:
print ("Loading took too much time!")
last_height = browser.execute_script("return document.body.scrollHeight")
items=[]
while True:
browser.find_element_by_tag_name('body').send_keys(Keys.END)
time.sleep(1)
allElements = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located(
(By.XPATH, "//div[@class='aws-pricing-table-wrapper']/h2[starts-with(text(),'c5.')]")))
print(len(allElements))
for el in allElements:
if el.text in items:
continue
items.append(el.get_attribute("textContent").strip())
items.append(el.find_element_by_xpath("./following-sibling::table[4]//tr//th[contains(.,'Convertible 3-Year Term')]/following::tbody[1]//tr[1]//td[4]").get_attribute("textContent").strip())
new_height = browser.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
#Print all items and their price.
print(items)
#Get the length of the list #it should be 244X2
print(len(items))
控制台输出:
Page is ready!
9
88
['c5.large', '$0.041', 'c5.xlarge', '$0.081', 'c5.2xlarge', '$0.162', 'c5.4xlarge', '$0.324', 'c5.9xlarge', '$0.729', 'c5.12xlarge', '$0.985', 'c5.18xlarge', '$1.459', 'c5.24xlarge', '$1.970', 'c5.metal', '$1.970', 'c5.large', '$0.101', 'c5.xlarge', '$0.141', 'c5.2xlarge', '$0.292', 'c5.4xlarge', '$0.454', 'c5.9xlarge', '$0.859', 'c5.12xlarge', '$1.115', 'c5.18xlarge', '$1.589', 'c5.24xlarge', '$2.100', 'c5.metal', '$2.100', 'c5.large', '$0.074', 'c5.xlarge', '$0.114', 'c5.2xlarge', '$0.195', 'c5.4xlarge', '$0.357', 'c5.9xlarge', '$0.762', 'c5.12xlarge', '$1.018', 'c5.18xlarge', '$1.492', 'c5.24xlarge', '$2.003', 'c5.metal', '$2.003', 'c5.large', '$0.133', 'c5.xlarge', '$0.265', 'c5.2xlarge', '$0.530', 'c5.4xlarge', '$1.060', 'c5.9xlarge', '$2.385', 'c5.12xlarge', '$3.193', 'c5.18xlarge', '$4.771', 'c5.24xlarge', '$6.386', 'c5.metal', '$6.386', 'c5.large', '$0.613', 'c5.xlarge', '$0.745', 'c5.2xlarge', '$1.490', 'c5.4xlarge', '$2.980', 'c5.9xlarge', '$6.705', 'c5.12xlarge', '$8.953', 'c5.18xlarge', '$13.411', 'c5.24xlarge', '$17.906', 'c5.metal', '$17.906', 'c5.large', '$0.200', 'c5.xlarge', '$0.333', 'c5.2xlarge', '$0.665', 'c5.4xlarge', '$1.331', 'c5.9xlarge', '$2.994', 'c5.12xlarge', '$4.004', 'c5.18xlarge', '$5.988', 'c5.24xlarge', '$8.008', 'c5.metal', '$8.008', 'c5.xlarge', '$1.765', 'c5.2xlarge', '$3.530', 'c5.4xlarge', '$7.060', 'c5.9xlarge', '$15.885', 'c5.12xlarge', '$21.193', 'c5.18xlarge', '$31.771', 'c5.24xlarge', '$42.386', 'c5.metal', '$42.386', 'c5.large', '$0.521', 'c5.xlarge', '$0.561', 'c5.2xlarge', '$1.122', 'c5.4xlarge', '$2.244', 'c5.9xlarge', '$5.049', 'c5.12xlarge', '$6.745', 'c5.18xlarge', '$10.099', 'c5.24xlarge', '$13.490', 'c5.metal', '$13.490', 'c5.large', '$0.108', 'c5.xlarge', '$0.149', 'c5.2xlarge', '$0.297', 'c5.4xlarge', '$0.595', 'c5.9xlarge', '$1.338', 'c5.12xlarge', '$1.796', 'c5.18xlarge', '$2.676', 'c5.24xlarge', '$3.592', 'c5.metal', '$3.592', 'c5.xlarge', '$1.581', 'c5.2xlarge', '$3.162', 'c5.4xlarge', '$6.324', 'c5.9xlarge', '$14.229', 'c5.12xlarge', '$18.985', 'c5.18xlarge', '$28.459', 'c5.24xlarge', '$37.970', 'c5.metal', '$37.970']
176
解决方案 2:
如果你去网络选项卡你会得到以下 API
https://a0.p.awsstatic.com/pricing/1.0/ec2/region/us-east-2/reserved-instance/linux/index.json?
以json()
格式返回结果。
import requests
res=requests.get("https://a0.p.awsstatic.com/pricing/1.0/ec2/region/us-east-2/reserved-instance/linux/index.json?",verify=False).json()
for item in res['prices']:
if 'c5.' in item['attributes']['aws:ec2:instanceType']:
if item['attributes']['aws:offerTermLeaseLength']=="3yr" and item['attributes']['aws:offerTermOfferingClass'] =="convertible" and item['attributes']['aws:offerTermPurchaseOption']=="No Upfront":
print(item['attributes']['aws:ec2:instanceType'])
print('$' + str(item['calculatedPrice']['effectiveHourlyRate']['USD']))
输出:
c5.xlarge
$0.08100000000000002
c5.18xlarge
$1.4589999999999999
c5.4xlarge
$0.32400000000000007
c5.2xlarge
$0.16200000000000003
c5.24xlarge
$1.97
c5.9xlarge
$0.729
c5.metal
$1.97
c5.12xlarge
$0.985
c5.large
$0.041
【讨论】:
它有效,但没有到最后 :) 每个实例的价格都不正确 :(c5.large
的价格应该是多少,你能举个例子吗?
对于所有实例的价格应取自可转换 3 年期限和有效每小时无预付费用,因此例如对于 c5.large,它应为 0.041 美元
如果您希望我也可以发布该答案,那么使用 python 请求模块在没有硒的情况下很容易获得该值。
如果可以的话:) 我想学习除 selenium 之外的另一个网页抓取工具。这个解决方案也会比这个更快吗?以上是关于在 selenium 中加速网页抓取的主要内容,如果未能解决你的问题,请参考以下文章
用于网页抓取的 Selenium 与 BeautifulSoup
使用 Selenium Python 进行网页抓取 [Twitter + Instagram]
从零开始学Python-使用Selenium抓取动态网页数据