我没有得到所有的结果。使用 Selenium 和 Python 进行网页抓取
Posted
技术标签:
【中文标题】我没有得到所有的结果。使用 Selenium 和 Python 进行网页抓取【英文标题】:I'm not getting all results. Web scraping with Selenium and Python 【发布时间】:2022-01-21 18:31:44 【问题描述】:我是使用 python 和 selenium 进行网络抓取的新手。我的脚本有一个问题,就是我没有从页面中获取所有 url。它应该是 80 个 url,但我只得到 20 个。另外,我想知道谁进入了 url 并从中提取数据。我有一个解决方案,但我想知道谁用 selenium 做同样的事情。
进口
import requests
import lxml.html as html
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC, wait
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
代码
def car_url():
ser = Service("D:/Usuario/Desktop/chrome_driver/chromedriver.exe")
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(service=ser, options=options)
driver.get("https://www.contactcars.com/en/cars/new/summary?page=1")
links = driver.find_elements(By.XPATH,
'//a[@class="car-card__engines__body__list__item__link"]')
for link in links:
response = requests.get(link.get_attribute('href'))
if response.status_code == 200:
try:
home = response.content.decode('utf-8')
parsed = html.fromstring(home)
name = parsed.xpath('//div[@class="d-inline-block"]/text()')
model = parsed.xpath(
'//div[@class="d-inline-block margin-start--sm"]/text()')
engine = parsed.xpath(
'//div[contains(@class,"engine-title")]/text()')
print(
f"Car name = str(name) Model = str(model) Engine = str(engine)")
except IndexError as e:
print(e)
driver.close()
感谢您对此提供的帮助。
【问题讨论】:
看起来页面有分页和延迟加载,你可能需要让它滚动到底部然后运行 selenium 代码 您是否打印了links
以查看实际返回的内容?
您可以先在开发者工具中检查您的 xpath,如果它提供了正确数量的链接。
可能要检查 api。
【参考方案1】:
当您向下滚动时,您要查看的网页会动态更新元素。
因此,您必须一直向下滚动才能从该网页中删除所有对象。
下面的代码是您的代码的修改版本。 (在我的机器上需要 2m 45.5s)
进口
import time
import requests
import lxml.html as html
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC, wait
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
向下滚动的新功能
def scroll_down_all(driver, pause_sec=1):
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Pause
time.sleep(pause_sec)
# After scroll down, get current height.
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
修改了你的代码功能
def car_url():
ser = Service("D:/Usuario/Desktop/chrome_driver/chromedriver.exe")
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(service=ser, options=options)
driver.get("https://www.contactcars.com/en/cars/new/summary?page=1")
scroll_down_all(driver, pause_sec=1) # New function for scroll down
links = driver.find_elements(By.XPATH,
'//a[@class="car-card__engines__body__list__item__link"]')
print(len(links))
for link in links:
response = requests.get(link.get_attribute('href'))
if response.status_code == 200:
try:
home = response.content.decode('utf-8')
parsed = html.fromstring(home)
name = parsed.xpath('//div[@class="d-inline-block"]/text()')
model = parsed.xpath(
'//div[@class="d-inline-block margin-start--sm"]/text()')
engine = parsed.xpath(
'//div[contains(@class,"engine-title")]/text()')
print(
f"Car name = str(name) Model = str(model) Engine = str(engine)")
except IndexError as e:
print(e)
driver.close()
这段代码的结果
Car name = ['Alfa Romeo Giulia '] Model = ['2021'] Engine = ['2.0 A/T Base ']
Car name = ['Alfa Romeo Giulia '] Model = ['2021'] Engine = ['2.0 A/T Super ']
Car name = ['Alfa Romeo Giulia '] Model = ['2021'] Engine = ['2.0 A/T Veloce ']
Car name = ['Alfa Romeo Giulietta '] Model = ['2021'] Engine = ['1.4 A/T premium ']
Car name = ['Alfa Romeo Giulietta veloce '] Model = ['2021'] Engine = ['1.8 A/T F/O ']
Car name = ['Alfa Romeo Stelvio '] Model = ['2021'] Engine = ['2.0 A/T Turbo Super ']
Car name = ['Alfa Romeo Stelvio '] Model = ['2021'] Engine = ['2.0 A/T Turbo Super plus ']
Car name = ['Aston Martin Vantage '] Model = ['2020'] Engine = ['4.0 A/T F/O Coupe ']
Car name = ['Audi A3 '] Model = ['2022'] Engine = ['1.4 A/T Advanced ']
Car name = ['Audi A3 '] Model = ['2022'] Engine = ['1.4 A/T S-Line ']
Car name = ['Audi A4 '] Model = ['2022'] Engine = ['2.0 A/T S-Line ']
Car name = ['Audi A4 '] Model = ['2022'] Engine = ['2.0 A/T S-Line plus ']
Car name = ['Audi A5 '] Model = ['2022'] Engine = ['2.0 A/T S-Line ']
Car name = ['Audi A6 '] Model = ['2022'] Engine = ['2.0 A/T S-Line ']
Car name = ['Audi Q2 '] Model = ['2022'] Engine = ['1.4 A/T Advanced ']
Car name = ['Audi Q2 '] Model = ['2022'] Engine = ['1.4 A/T S line ']
Car name = ['Audi Q3 '] Model = ['2022'] Engine = ['1.4 A/T Advanced ']
Car name = ['Audi Q3 '] Model = ['2022'] Engine = ['1.4 A/T S-Line New Rim ']
Car name = ['Audi Q3 '] Model = ['2022'] Engine = ['1.4 A/T Sport back with New Rim ']
Car name = ['Audi Q3 '] Model = ['2022'] Engine = ['2.0 A/T Sport back ']
Car name = ['Audi Q7 '] Model = ['2022'] Engine = ['2.0 A/T TFSI Premium ']
Car name = ['Audi Q8 '] Model = ['2022'] Engine = ['3.0 A/T S-Line ']
Car name = ['Audi Q8 '] Model = ['2022'] Engine = ['3.0 A/T S-Line Plus ']
Car name = ['Audi RS Q8 '] Model = ['2022'] Engine = ['4.0 A/T TFSI Quattro ']
Car name = ['BAIC X3 '] Model = ['2022'] Engine = ['1.5 A/T Comfort ']
Car name = ['BAIC X3 '] Model = ['2022'] Engine = ['1.5 A/T Comfort Plus ']
Car name = ['BAIC X3 '] Model = ['2022'] Engine = ['1.5 A/T Luxury ']
Car name = ['BAIC X3 '] Model = ['2022'] Engine = ['1.5 A/T Luxury Plus ']
Car name = ['BAIC X5 '] Model = ['2022'] Engine = ['1.5 A/T Turbo ']
Car name = ['BAIC X7 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Elite ']
Car name = ['BAIC X7 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Premium ']
Car name = ['BMW 118i '] Model = ['2021'] Engine = ['1.5 A/T Sport line ']
Car name = ['BMW 218 '] Model = ['2020'] Engine = ['1.5 A/T F/O Exclusive 7 Seats ']
Car name = ['BMW 218i '] Model = ['2021'] Engine = ['1.5 A/T Gran Coupe ']
Car name = ['BMW 320i '] Model = ['2021'] Engine = ['2.0 A/T Exclusive ']
Car name = ['BMW 520i '] Model = ['2021'] Engine = ['1.6 A/T Luxury ']
Car name = ['BMW 730Li '] Model = ['2021'] Engine = ['2.0 A/T F/O ']
Car name = ['BMW 740Li '] Model = ['2020'] Engine = ['3.0 A/T Pure Excellence ']
Car name = ['BMW 750i '] Model = ['2020'] Engine = ['4.4 F/O A/T ']
Car name = ['BMW M850i '] Model = ['2021'] Engine = ['4.4 F/O A/T Gran Coupe ']
Car name = ['BMW X1 '] Model = ['2021'] Engine = ['1.5 A/T X-Line ']
Car name = ['BMW X1 '] Model = ['2021'] Engine = ['1.5 A/T Sport ']
Car name = ['BMW X2 '] Model = ['2020'] Engine = ['1.5 A/T Exclusive ']
Car name = ['BMW X2 '] Model = ['2020'] Engine = ['1.5 A/T M-Sport ']
Car name = ['BMW X3 '] Model = ['2021'] Engine = ['2.0 A/T 30i New Shape ']
Car name = ['BMW X3 '] Model = ['2021'] Engine = ['3.0 A/T M40i New Shape ']
Car name = ['BMW X4 '] Model = ['2021'] Engine = ['2.0 A/T 30i M ']
Car name = ['BMW X5 '] Model = ['2021'] Engine = ['3.0 A/T 40i ']
Car name = ['BMW X5 '] Model = ['2021'] Engine = ['4.4 F/O A/T M50i ']
Car name = ['BMW X6 '] Model = ['2021'] Engine = ['4.4 F/O A/T M50i-Sport ']
Car name = ['BMW X7 '] Model = ['2021'] Engine = ['4.4 A/T M50i ']
Car name = ['BMW Z4 '] Model = ['2020'] Engine = ['2.0 A/T M-Sport Roadster ']
Car name = ['BMW i8 '] Model = ['2020'] Engine = ['1.5 A/T Coupe LCI ']
Car name = ['BMW i8 '] Model = ['2020'] Engine = ['1.5 A/T Roadster ']
Car name = ['BYD F3 '] Model = ['2022'] Engine = ['1.5 M/T ']
Car name = ['BYD F3 '] Model = ['2022'] Engine = ['1.5 A/T ']
Car name = ['BYD L3 '] Model = ['2022'] Engine = ['1.5 A/T GLX ']
Car name = ['BYD L3 '] Model = ['2022'] Engine = ['1.5 A/T GS ']
Car name = ['Bestune B70 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Standard ']
Car name = ['Bestune B70 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Comfort ']
Car name = ['Bestune B70 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Deluxe ']
Car name = ['Bestune T33 '] Model = ['2022'] Engine = ['1.6 A/T Luxury ']
Car name = ['Bestune T33 '] Model = ['2022'] Engine = ['1.6 A/T Premium ']
Car name = ['Bestune T55 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Deluxe ']
Car name = ['Bestune T55 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Premium ']
Car name = ['Bestune T55 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Premium Plus ']
Car name = ['Bestune T77 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Deluxe ']
Car name = ['Bestune T77 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Premium ']
Car name = ['Bestune T77 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Premium Plus ']
Car name = ['Brilliance V6 '] Model = ['2021'] Engine = ['1.5 A/T Deluxe ']
Car name = ['Brilliance V6 '] Model = ['2021'] Engine = ['1.5 A/T Grand Deluxe ']
Car name = ['Brilliance V6 '] Model = ['2021'] Engine = ['1.5 A/T Flagship ']
Car name = ['Changan Alsvin '] Model = ['2022'] Engine = ['1.4 M/T L1 ']
Car name = ['Changan Alsvin '] Model = ['2022'] Engine = ['A/T L1 1.5 ']
Car name = ['Changan Alsvin '] Model = ['2022'] Engine = ['A/T L2 1.5 ']
Car name = ['Changan CS15 '] Model = ['2022'] Engine = ['A/T L1 1.5 ']
Car name = ['Changan CS15 '] Model = ['2022'] Engine = ['A/T L2 1.5 ']
Car name = ['Changan CS55 '] Model = ['2022'] Engine = ['A/T Turbo L1 1.5 ']
Car name = ['Changan CS55 '] Model = ['2022'] Engine = ['A/T Turbo L2 1.5 ']
Car name = ['Changan CS55 '] Model = ['2022'] Engine = ['A/T Turbo L3 1.5 ']
【讨论】:
感谢您的帮助。你知道我在哪里可以读到关于网页抓取有用的 javascript 代码吗?。 对不起,我不熟悉javascript。根据我的经验,了解 HTML、客户端-服务器等 Web 环境对制作 Web 抓取机器人更有帮助。以上是关于我没有得到所有的结果。使用 Selenium 和 Python 进行网页抓取的主要内容,如果未能解决你的问题,请参考以下文章
我正在尝试使用selenium webdriver从Instagram中删除名称?
Selenium xpath all (//*) 不会占用所有 css 元素