我没有得到所有的结果。使用 Selenium 和 Python 进行网页抓取

Posted

技术标签:

【中文标题】我没有得到所有的结果。使用 Selenium 和 Python 进行网页抓取【英文标题】:I'm not getting all results. Web scraping with Selenium and Python 【发布时间】:2022-01-21 18:31:44 【问题描述】:

我是使用 python 和 selenium 进行网络抓取的新手。我的脚本有一个问题,就是我没有从页面中获取所有 url。它应该是 80 个 url,但我只得到 20 个。另外,我想知道谁进入了 url 并从中提取数据。我有一个解决方案,但我想知道谁用 selenium 做同样的事情。

进口

import requests
import lxml.html as html
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC, wait
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

代码

def car_url():
    ser = Service("D:/Usuario/Desktop/chrome_driver/chromedriver.exe")
    options = webdriver.ChromeOptions()
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    driver = webdriver.Chrome(service=ser, options=options)
    driver.get("https://www.contactcars.com/en/cars/new/summary?page=1")

    links = driver.find_elements(By.XPATH,
                                 '//a[@class="car-card__engines__body__list__item__link"]')

    for link in links:
        response = requests.get(link.get_attribute('href'))
        if response.status_code == 200:
            try:
                home = response.content.decode('utf-8')
                parsed = html.fromstring(home)
                name = parsed.xpath('//div[@class="d-inline-block"]/text()')
                model = parsed.xpath(
                    '//div[@class="d-inline-block margin-start--sm"]/text()')
                engine = parsed.xpath(
                    '//div[contains(@class,"engine-title")]/text()')
                print(
                    f"Car name = str(name) Model = str(model) Engine = str(engine)")
            except IndexError as e:
                print(e)

    driver.close()

感谢您对此提供的帮助。

【问题讨论】:

看起来页面有分页和延迟加载,你可能需要让它滚动到底部然后运行 ​​selenium 代码 您是否打印了links 以查看实际返回的内容? 您可以先在开发者工具中检查您的 xpath,如果它提供了正确数量的链接。 可能要检查 api。 【参考方案1】:

当您向下滚动时,您要查看的网页会动态更新元素。

因此,您必须一直向下滚动才能从该网页中删除所有对象。

下面的代码是您的代码的修改版本。 (在我的机器上需要 2m 45.5s)

进口

import time
import requests
import lxml.html as html
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC, wait
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

向下滚动的新功能

def scroll_down_all(driver, pause_sec=1):

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Pause
    time.sleep(pause_sec)

    # After scroll down, get current height.
    new_height = driver.execute_script("return document.body.scrollHeight")
    
    if new_height == last_height:
        break
    
    last_height = new_height

修改了你的代码功能

def car_url():
    
    ser = Service("D:/Usuario/Desktop/chrome_driver/chromedriver.exe")
    options = webdriver.ChromeOptions()
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    driver = webdriver.Chrome(service=ser, options=options)
    driver.get("https://www.contactcars.com/en/cars/new/summary?page=1")

    scroll_down_all(driver, pause_sec=1) # New function for scroll down
    
    links = driver.find_elements(By.XPATH,
                                 '//a[@class="car-card__engines__body__list__item__link"]')

    print(len(links))
    for link in links:
        response = requests.get(link.get_attribute('href'))
        if response.status_code == 200:
            try:
                home = response.content.decode('utf-8')
                parsed = html.fromstring(home)
                name = parsed.xpath('//div[@class="d-inline-block"]/text()')
                model = parsed.xpath(
                    '//div[@class="d-inline-block margin-start--sm"]/text()')
                engine = parsed.xpath(
                    '//div[contains(@class,"engine-title")]/text()')
                print(
                    f"Car name = str(name) Model = str(model) Engine = str(engine)")
            except IndexError as e:
                print(e)

    driver.close()

这段代码的结果

Car name = ['Alfa Romeo Giulia '] Model = ['2021'] Engine = ['2.0 A/T Base ']
Car name = ['Alfa Romeo Giulia '] Model = ['2021'] Engine = ['2.0 A/T Super ']
Car name = ['Alfa Romeo Giulia '] Model = ['2021'] Engine = ['2.0 A/T Veloce ']
Car name = ['Alfa Romeo Giulietta '] Model = ['2021'] Engine = ['1.4 A/T premium ']
Car name = ['Alfa Romeo Giulietta veloce '] Model = ['2021'] Engine = ['1.8 A/T F/O ']
Car name = ['Alfa Romeo Stelvio '] Model = ['2021'] Engine = ['2.0 A/T Turbo Super ']
Car name = ['Alfa Romeo Stelvio '] Model = ['2021'] Engine = ['2.0 A/T Turbo Super plus ']
Car name = ['Aston Martin Vantage '] Model = ['2020'] Engine = ['4.0 A/T F/O Coupe ']
Car name = ['Audi A3 '] Model = ['2022'] Engine = ['1.4 A/T Advanced ']
Car name = ['Audi A3 '] Model = ['2022'] Engine = ['1.4 A/T S-Line ']
Car name = ['Audi A4 '] Model = ['2022'] Engine = ['2.0 A/T S-Line ']
Car name = ['Audi A4 '] Model = ['2022'] Engine = ['2.0 A/T S-Line plus ']
Car name = ['Audi A5 '] Model = ['2022'] Engine = ['2.0 A/T S-Line ']
Car name = ['Audi A6 '] Model = ['2022'] Engine = ['2.0 A/T S-Line ']
Car name = ['Audi Q2 '] Model = ['2022'] Engine = ['1.4 A/T Advanced ']
Car name = ['Audi Q2 '] Model = ['2022'] Engine = ['1.4 A/T S line ']
Car name = ['Audi Q3 '] Model = ['2022'] Engine = ['1.4 A/T Advanced ']
Car name = ['Audi Q3 '] Model = ['2022'] Engine = ['1.4 A/T S-Line New Rim ']
Car name = ['Audi Q3 '] Model = ['2022'] Engine = ['1.4 A/T Sport back with New Rim ']
Car name = ['Audi Q3 '] Model = ['2022'] Engine = ['2.0 A/T Sport back ']
Car name = ['Audi Q7 '] Model = ['2022'] Engine = ['2.0 A/T TFSI Premium ']
Car name = ['Audi Q8 '] Model = ['2022'] Engine = ['3.0 A/T S-Line ']
Car name = ['Audi Q8 '] Model = ['2022'] Engine = ['3.0 A/T S-Line Plus ']
Car name = ['Audi RS Q8 '] Model = ['2022'] Engine = ['4.0 A/T TFSI Quattro ']
Car name = ['BAIC X3 '] Model = ['2022'] Engine = ['1.5 A/T Comfort ']
Car name = ['BAIC X3 '] Model = ['2022'] Engine = ['1.5 A/T Comfort Plus ']
Car name = ['BAIC X3 '] Model = ['2022'] Engine = ['1.5 A/T Luxury ']
Car name = ['BAIC X3 '] Model = ['2022'] Engine = ['1.5 A/T Luxury Plus ']
Car name = ['BAIC X5 '] Model = ['2022'] Engine = ['1.5 A/T Turbo ']
Car name = ['BAIC X7 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Elite ']
Car name = ['BAIC X7 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Premium ']
Car name = ['BMW 118i '] Model = ['2021'] Engine = ['1.5 A/T Sport line ']
Car name = ['BMW 218 '] Model = ['2020'] Engine = ['1.5 A/T F/O Exclusive 7 Seats ']
Car name = ['BMW 218i '] Model = ['2021'] Engine = ['1.5 A/T Gran Coupe ']
Car name = ['BMW 320i '] Model = ['2021'] Engine = ['2.0 A/T Exclusive ']
Car name = ['BMW 520i '] Model = ['2021'] Engine = ['1.6 A/T Luxury ']
Car name = ['BMW 730Li '] Model = ['2021'] Engine = ['2.0 A/T F/O ']
Car name = ['BMW 740Li '] Model = ['2020'] Engine = ['3.0 A/T Pure Excellence ']
Car name = ['BMW 750i '] Model = ['2020'] Engine = ['4.4 F/O A/T ']
Car name = ['BMW M850i '] Model = ['2021'] Engine = ['4.4 F/O A/T Gran Coupe ']
Car name = ['BMW X1 '] Model = ['2021'] Engine = ['1.5 A/T X-Line ']
Car name = ['BMW X1 '] Model = ['2021'] Engine = ['1.5 A/T Sport ']
Car name = ['BMW X2 '] Model = ['2020'] Engine = ['1.5 A/T Exclusive ']
Car name = ['BMW X2 '] Model = ['2020'] Engine = ['1.5 A/T M-Sport ']
Car name = ['BMW X3 '] Model = ['2021'] Engine = ['2.0 A/T 30i New Shape ']
Car name = ['BMW X3 '] Model = ['2021'] Engine = ['3.0 A/T M40i New Shape ']
Car name = ['BMW X4 '] Model = ['2021'] Engine = ['2.0 A/T 30i M ']
Car name = ['BMW X5 '] Model = ['2021'] Engine = ['3.0 A/T 40i ']
Car name = ['BMW X5 '] Model = ['2021'] Engine = ['4.4 F/O A/T M50i ']
Car name = ['BMW X6 '] Model = ['2021'] Engine = ['4.4 F/O A/T M50i-Sport ']
Car name = ['BMW X7 '] Model = ['2021'] Engine = ['4.4 A/T M50i ']
Car name = ['BMW Z4 '] Model = ['2020'] Engine = ['2.0 A/T M-Sport Roadster ']
Car name = ['BMW i8 '] Model = ['2020'] Engine = ['1.5 A/T Coupe LCI ']
Car name = ['BMW i8 '] Model = ['2020'] Engine = ['1.5 A/T Roadster ']
Car name = ['BYD F3 '] Model = ['2022'] Engine = ['1.5 M/T ']
Car name = ['BYD F3 '] Model = ['2022'] Engine = ['1.5 A/T ']
Car name = ['BYD L3 '] Model = ['2022'] Engine = ['1.5 A/T GLX ']
Car name = ['BYD L3 '] Model = ['2022'] Engine = ['1.5 A/T GS ']
Car name = ['Bestune B70 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Standard ']
Car name = ['Bestune B70 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Comfort ']
Car name = ['Bestune B70 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Deluxe ']
Car name = ['Bestune T33 '] Model = ['2022'] Engine = ['1.6 A/T Luxury ']
Car name = ['Bestune T33 '] Model = ['2022'] Engine = ['1.6 A/T Premium ']
Car name = ['Bestune T55 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Deluxe ']
Car name = ['Bestune T55 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Premium ']
Car name = ['Bestune T55 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Premium Plus ']
Car name = ['Bestune T77 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Deluxe ']
Car name = ['Bestune T77 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Premium ']
Car name = ['Bestune T77 '] Model = ['2022'] Engine = ['1.5 A/T Turbo Premium Plus ']
Car name = ['Brilliance V6 '] Model = ['2021'] Engine = ['1.5 A/T Deluxe ']
Car name = ['Brilliance V6 '] Model = ['2021'] Engine = ['1.5 A/T Grand Deluxe ']
Car name = ['Brilliance V6 '] Model = ['2021'] Engine = ['1.5 A/T Flagship ']
Car name = ['Changan Alsvin '] Model = ['2022'] Engine = ['1.4 M/T L1 ']
Car name = ['Changan Alsvin '] Model = ['2022'] Engine = ['A/T L1 1.5 ']
Car name = ['Changan Alsvin '] Model = ['2022'] Engine = ['A/T L2 1.5 ']
Car name = ['Changan CS15 '] Model = ['2022'] Engine = ['A/T L1 1.5 ']
Car name = ['Changan CS15 '] Model = ['2022'] Engine = ['A/T L2 1.5 ']
Car name = ['Changan CS55 '] Model = ['2022'] Engine = ['A/T Turbo L1 1.5 ']
Car name = ['Changan CS55 '] Model = ['2022'] Engine = ['A/T Turbo L2 1.5 ']
Car name = ['Changan CS55 '] Model = ['2022'] Engine = ['A/T Turbo L3 1.5 ']

【讨论】:

感谢您的帮助。你知道我在哪里可以读到关于网页抓取有用的 javascript 代码吗?。 对不起,我不熟悉javascript。根据我的经验,了解 HTML、客户端-服务器等 Web 环境对制作 Web 抓取机器人更有帮助。

以上是关于我没有得到所有的结果。使用 Selenium 和 Python 进行网页抓取的主要内容,如果未能解决你的问题,请参考以下文章

使用 Selenium 从网页中获取所有可见文本

我正在尝试使用selenium webdriver从Instagram中删除名称?

使用Selenium如何获取网络请求

Selenium xpath all (//*) 不会占用所有 css 元素

使用 Python 在运行时将 selenium 结果/输出保存在文本文件中

使用 Selenium 将测试结果写入 Excel