脚本突然停止爬行,没有错误或异常

Posted

技术标签:

【中文标题】脚本突然停止爬行,没有错误或异常【英文标题】:Script Suddenly Stops Crawling Without Error or Exception 【发布时间】:2019-03-12 12:37:00 【问题描述】:

我不知道为什么,但我的脚本在到达page 9 时总是停止爬行。没有错误、异常或警告,所以我有点不知所措。

有人可以帮帮我吗?

附: Here is the full script in case anybody wants to test it for themselves!

def initiate_crawl():
    def refresh_page(url):
        ff = create_webdriver_instance()
        ff.get(url)
        ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
        ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
        items = WebDriverWait(ff, 15).until(
            EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
        )
        print(len(items))
        for count, item in enumerate(items):
            slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
            active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
            if len(slashed_price) > 0 and len(active_deals) > 0:
                product_title = item.find_element(By.ID, 'dealTitle').text
                if product_title not in already_scraped_product_titles:
                    already_scraped_product_titles.append(product_title)
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                    break
            if count+1 is len(items):
                try:
                    next_button = WebDriverWait(ff, 15).until(
                        EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
                    )
                    ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                except Exception as error:
                    print(error)
                    ff.quit()

    refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')

initiate_crawl()

打印items 的长度也会引发一些奇怪的行为。它不是总是返回 32,这将对应于每页上的项目数,而是在第一页打印 32,在第二页打印 64,在第三页打印 96,依此类推。我通过使用//div[contains(@id, "100_dealView_")]/div[contains(@class, "dealContainer")] 而不是//div[contains(@id, "100_dealView_")] 作为items 变量的XPath 解决了这个问题。我希望这就是它在第 9 页遇到问题的原因。我现在正在运行测试。 更新:现在正在抓取第 10 页及以后,因此问题已解决。

【问题讨论】:

你监控爬取过程了吗?第9页还有“更多”之类的按钮吗? @jihan1008 一切都受到监控。我检查了 xpath,一切,似乎没有任何问题 你能检查不同的浏览器版本吗 我无法让你的脚本运行,但似乎在某些时候你得到了长度为 0 的项目,因此枚举循环没有发生。尝试在循环之前打印项目的长度,看看在代码结束之前会发生什么。 @AndrewMcDowell 好主意!我在 pt,我相信它一定在剧本的其他地方。我目前设置了一堆time.sleep(n)s 并用它进行测试。之后我会打印长度!感谢您的输入 【参考方案1】:

根据你的10th revision这个问题的错误信息...

HTTPConnectionPool(host='127.0.0.1', port=58992): Max retries exceeded with url: /session/e8beed9b-4faa-4e91-a659-56761cb604d7/element (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000022D31378A58>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))

...暗示 get() 方法未能引发 HTTPConnectionPool 错误并显示消息 Max retries exceeded

有几点:

根据max-retries-exceeded exceptions are confusing 的讨论,traceback 有点误导。 Requests 包装了异常以方便用户使用。原始异常是显示的消息的一部分。

请求从不重试(它将retries=0 设置为urllib3 的HTTPConnectionPool),因此如果没有MaxRetryErrorHTTPConnectionPool 关键字,错误会更加规范。所以理想的 Traceback 应该是:

  NewConnectionError(<class 'socket.error'>: [Errno 10061] No connection could be made because the target machine actively refused it)

你会在MaxRetryError: HTTPConnectionPool: Max retries exceeded (Caused by ProtocolError('Connection aborted.', error(111, 'Connection refused')))找到详细的解释

解决方案

根据 Selenium 3.14.1发行说明

* Fix ability to set timeout for urllib3 (#6286)

合并是:repair urllib3 can't set timeout!

结论

升级到 Selenium 3.14.1 后,您将能够设置超时并查看规范 Tracebacks 并能够采取必要的措施。

参考文献

一些相关的参考资料:

Adding max_retries as an argument Removed the bundled charade and urllib3. Third party libraries committed verbatim

这个用例

我已从codepen.io - A PEN BY Anthony 获取了您的完整脚本。我不得不对您现有的代码进行一些调整,如下所示:

如你所用:

  ua_string = random.choice(ua_strings)

您必须强制将random 导入为:

    import random

您已创建变量next_button,但尚未使用它。我总结了以下四行:

  next_button = WebDriverWait(ff, 15).until(
                  EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
              )
  ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()

作为:

  WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→'))
  ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()              

您修改后的代码块将是:

  # -*- coding: utf-8 -*-
  from selenium import webdriver
  from selenium.webdriver.firefox.options import Options
  from selenium.webdriver.common.by import By
  from selenium.webdriver.support import expected_conditions as EC
  from selenium.webdriver.support.ui import WebDriverWait
  import time
  import random


  """ Set Global Variables
  """
  ua_strings = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/68.0.3440.106 Safari/537.36']
  already_scraped_product_titles = []



  """ Create Instances of WebDriver
  """
  def create_webdriver_instance():
      ua_string = random.choice(ua_strings)
      profile = webdriver.FirefoxProfile()
      profile.set_preference('general.useragent.override', ua_string)
      options = Options()
      options.add_argument('--headless')
      return webdriver.Firefox(profile)



  """ Construct List of UA Strings
  """
  def fetch_ua_strings():
      ff = create_webdriver_instance()
      ff.get('https://techblog.willshouse.com/2012/01/03/most-common-user-agents/')
      ua_strings_ff_eles = ff.find_elements_by_xpath('//td[@class="useragent"]')
      for ua_string in ua_strings_ff_eles:
          if 'mobile' not in ua_string.text and 'Trident' not in ua_string.text:
              ua_strings.append(ua_string.text)
      ff.quit()



  """ Log in to Amazon to Use SiteStripe in order to Generate Affiliate Links
  """
  def log_in(ff):
      ff.find_element(By.XPATH, '//a[@id="nav-link-yourAccount"] | //a[@id="nav-link-accountList"]').click()
      ff.find_element(By.ID, 'ap_email').send_keys('anthony_falez@hotmail.com')
      ff.find_element(By.ID, 'continue').click()
      ff.find_element(By.ID, 'ap_password').send_keys('lo0kyLoOkYig0t4h')
      ff.find_element(By.NAME, 'rememberMe').click()
      ff.find_element(By.ID, 'signInSubmit').click()



  """ Build Lists of Product Page URLs
  """
  def initiate_crawl():
      def refresh_page(url):
      ff = create_webdriver_instance()
      ff.get(url)
      ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
      ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
      items = WebDriverWait(ff, 15).until(
          EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
      )
      for count, item in enumerate(items):
          slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
          active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
          # For Groups of Items on Sale
          # active_deals = //*[contains(text(), "Add to Cart") or contains(text(), "View Deal")]
          if len(slashed_price) > 0 and len(active_deals) > 0:
              product_title = item.find_element(By.ID, 'dealTitle').text
              if product_title not in already_scraped_product_titles:
                  already_scraped_product_titles.append(product_title)
                  url = ff.current_url
                  # Scrape Details of Each Deal
                  #extract(ff, item.find_element(By.ID, 'dealImage').get_attribute('href'))
                  print(product_title[:10])
                  ff.quit()
                  refresh_page(url)
                  break
          if count+1 is len(items):
              try:
                  print('')
                  print('new page')
                  WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→'))
                  ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
                  time.sleep(10)
                  url = ff.current_url
                  print(url)
                  print('')
                  ff.quit()
                  refresh_page(url)
              except Exception as error:
                  """
                  ff.find_element(By.XPATH, '//*[@id="pagination-both-004143081429407891"]/ul/li[9]/a').click()
                  url = ff.current_url
                  ff.quit()
                  refresh_page(url)
                  """
                  print('cannot find ff.find_element(By.PARTIAL_LINK_TEXT, "Next?")')
                  print('Because of... '.format(error))
                  ff.quit()

      refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')

  #def extract_info(ff, url):
  fetch_ua_strings()
  initiate_crawl()

控制台输出:使用 Selenium v​​3.14.0Firefox Quantum v62.0.3,我可以在控制台上提取以下输出:

  J.Rosée Si
  B.Catcher 
  Bluetooth4
  FRAM G4164
  Major Crim
  20% off Oh
  True Blood
  Prime-Line
  Marathon 3
  True Blood
  B.Catcher 
  4 Film Fav
  True Blood
  Texture Pa
  Westinghou
  True Blood
  ThermoPro 
  ...
  ...
  ...

注意:我可以优化您的代码并执行相同的网页抓取操作来初始化Firefox浏览器客户端一次并遍历各种产品及其详细信息。但为了保持您的逻辑创新,我建议进行最小限度的更改以帮助您完成。

【讨论】:

我的问题中的 HTTPConnectionPool 错误是异常值。该脚本运行良好,直到它在第 9 页突然停止,没有错误或异常。我设置但没有使用next_button 的唯一原因是因为我试图解决这个问题,认为这可能与它有关,并且从未重置它。这里的问题/问题是为什么它在完成第 9 页后停止抓取/抓取? 哦,也许这就是我从不重置它的原因。 WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')).click() 返回一个不可点击的 bool object。你的代码有错误 @Anthony 根据您的观察,我在解决方案中做了一些小的修改。我可以优化您的代码并执行相同的web 抓取,只打开一次Firefox 浏览器客户端 并遍历各种产品。但是为了保持你的逻辑创新,我建议你做一些最小的改变来让你通过。请您尝试更新的解决方案并告诉我状态好吗? 实际上我几天前就自己解决了这个问题。为了清楚起见,我已经更新了我的问题并再次更新了它。是的,我不使用单个实例遍历网站,而是创建多个实例是有特定原因的。虽然现在我想起来了,使用ff.back() 功能甚至可能不需要创建新实例,但它仍然肯定要简单得多。当我有空闲时间时,我会读你的答案!感谢您尝试解决我的问题 【参考方案2】:

我稍微调整了代码,它似乎可以工作。变化:

import random 语句,因为它已被使用,没有它就无法运行。

product_title 循环内,这些行被删除:

ff.quit()refresh_page(url)break

ff.quit() 语句会导致致命(连接)错误,从而导致脚本中断。

还将is 更改为==if count + 1 == len(item):

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import time
import random



""" Set Global Variables
"""
ua_strings = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36']
already_scraped_product_titles = []



""" Create Instances of WebDriver
"""
def create_webdriver_instance():
    ua_string = random.choice(ua_strings)
    profile = webdriver.FirefoxProfile()
    profile.set_preference('general.useragent.override', ua_string)
    options = Options()
    options.add_argument('--headless')
    return webdriver.Firefox(profile)

""" Construct List of UA Strings
"""
def fetch_ua_strings():
    ff = create_webdriver_instance()
    ff.get('https://techblog.willshouse.com/2012/01/03/most-common-user-agents/')
    ua_strings_ff_eles = ff.find_elements_by_xpath('//td[@class="useragent"]')
    for ua_string in ua_strings_ff_eles:
        if 'mobile' not in ua_string.text and 'Trident' not in ua_string.text:
            ua_strings.append(ua_string.text)
    ff.quit()

""" Build Lists of Product Page URLs
"""
def initiate_crawl():
    def refresh_page(url):
        ff = create_webdriver_instance()
        ff.get(url)
        ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
        ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
        items = WebDriverWait(ff, 15).until(
            EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
        )
        print(items)
        for count, item in enumerate(items):
            slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
            active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
            # For Groups of Items on Sale
            # active_deals = //*[contains(text(), "Add to Cart") or contains(text(), "View Deal")]
            if len(slashed_price) > 0 and len(active_deals) > 0:
                product_title = item.find_element(By.ID, 'dealTitle').text
                if product_title not in already_scraped_product_titles:
                    already_scraped_product_titles.append(product_title)
                    url = ff.current_url
                    # Scrape Details of Each Deal
                    #extract(ff, item.find_element(By.ID, 'dealImage').get_attribute('href'))
                    print(product_title[:10])
                    # This ff.quit()-line breaks connection which breaks things.:
                    #ff.quit()
                    # And why 
                    #refresh_page(url)
                    #break
            # 'is' tests for object equality; == tests for value equality:
            if count+1 == len(items):
                try:
                    print('')
                    print('new page')
                    next_button = WebDriverWait(ff, 15).until(
                        EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
                    )
                    ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()                    
                    time.sleep(3)
                    url = ff.current_url
                    print(url)
                    print('')
                    ff.quit()
                    refresh_page(url)
                except Exception as error:
                    """
                    ff.find_element(By.XPATH, '//*[@id="pagination-both-004143081429407891"]/ul/li[9]/a').click()
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                    """
                    print('cannot find ff.find_element(By.PARTIAL_LINK_TEXT, "Next→")')
                    print('Because of... '.format(error))
                    ff.quit()

    refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')

#def extract_info(ff, url):
fetch_ua_strings()
initiate_crawl()

【讨论】:

以上是关于脚本突然停止爬行,没有错误或异常的主要内容,如果未能解决你的问题,请参考以下文章

tomcat启动进程突然停止无错误日志输出

调试器不会因错误而停止

Google Cloud Functions 在没有错误消息的情况下崩溃

Twisted - CRITICAL - Deferred中的未处理错误...没有堆栈跟踪

thinkphp 5.0如何实现自定义404(异常处理)页面

通知突然停止在iOS应用程序中工作