Webkit_server(从 python 的 dryscrape 调用)在访问每个页面时使用越来越多的内存。如何减少使用的内存?

Posted

技术标签:

【中文标题】Webkit_server(从 python 的 dryscrape 调用)在访问每个页面时使用越来越多的内存。如何减少使用的内存?【英文标题】:Webkit_server (called from python's dryscrape) uses more and more memory with each page visited. How do I reduce the memory used? 【发布时间】:2015-08-25 18:45:49 【问题描述】:

我正在使用dryscrapepython3 中编写scraper。我试图在 __scraping __session 期间访问数百个不同的 url,并单击每个 url 上的大约 10 个 ajax 页面(每个 ajax 页面不访问不同的 url)。我需要像dryscrape 这样的东西,因为我需要能够与javascript 组件交互。我为我的需要编写的类可以工作,但是当我访问了大约 50 或 100 个页面时内存不足(所有 4Gbs 的内存都已使用,4Gbs 的交换磁盘空间几乎 100% 已满)。我查看了内存用尽的原因,似乎webkit_server 进程负责所有这些。为什么会发生这种情况,我该如何避免?

下面是我的类的相关sn-ps和我的main方法。

这是使用dryscape 的类,您可以确切地看到我使用的设置。

import dryscrape
from lxml import html
from time import sleep
from webkit_server import InvalidResponseError
import re

from utils import unugly, my_strip, cleanhtml, stringify_children
from Profile import Profile, Question

class ExampleSession():
    
    def __init__(self, settings):
        self.settings = settings
        # dryscrape.start_xvfb()
        self.br = self.getBrowser()

    def getBrowser(self):
        session = dryscrape.Session()
        session.set_attribute('auto_load_images', False)
        session.set_header('User-agent', 'Google Chrome')
        return session
        
    def login(self):
        try:
            print('Trying to log in... ')
            self.br.visit('https://www.example.com/login')                        
            self.br.at_xpath('//*[@id="login_username"]').set(self.settings['myUsername'])
            self.br.at_xpath('//*[@id="login_password"]').set(self.settings['myPassword'])
            q = self.br.at_xpath('//*[@id="loginbox_form"]')
            q.submit()
        except Exception as e:
            print(str(e))
            print('\tException and couldn\'t log in!')
            return
        print('Logged in as %s' % (str(self.settings['myUsername']))) 
                
    def getProfileQuestionsByUrl(self, url, thread_id=0):
        self.br.visit(str(url.rstrip()) + '/questions')
        
        tree = html.fromstring(self.br.body())
        questions = []
        
        num_pages = int(my_strip(tree.xpath('//*[@id="questions_pages"]//*[@class="last"]')[0].text))
    
        page = 0
        while (page < num_pages):
            sleep(0.5)
            # Do something with each ajax page
            # Next try-except tries to click the 'next' button
            try:
                next_button = self.br.at_xpath('//*[@id="questions_pages"]//*[@class="next"]')
                next_button.click()
            except Exception as e:
                pass                
            page = page + 1

        return questions
    
    def getProfileByUrl(self, url, thread_id=0):
        missing = 'NA'

        try:
            try:
                # Visit a unique url
                self.br.visit(url.rstrip())
            except Exception as e:
                print(str(e))
                return None
            tree = html.fromstring(self.br.body())

            map = 
            # Fill up the dictionary with some things I find on the page
            
            profile = Profile(map)    
            return profile
        except Exception as e:
            print(str(e))
            return None

这里是主要方法(sn-p):

def getProfiles(settings, urls, thread_id):
    exampleSess = ExampleSession(settings)
    exampleSess.login()

    profiles = []
    '''
    I want to visit at most a thousand unique urls (but I don't care if it
    will take 2 hours or 2 days as long as the session doesn't fatally break
    and my laptop doesn't run out of memory)
    '''
    for url in urls:            
        try:
            profile = exampleSess.getProfileByUrl(url, thread_id)
    
            if (profile is not None):
                profiles.append(profile)
                
                try:
                    if (settings['scrapeQuestions'] == 'yes'):
                        profile_questions = exampleSess.getProfileQuestionsByUrl(url, thread_id)
                    
                        if (profile_questions is not None):
                            profile.add_questions(profile_questions)
                except SocketError as e:
                    print(str(e))
                    print('\t[Thread %d] SocketError in getProfileQuestionsByUrl of profile...' % (thread_id))
                        
        except Exception as e:
            print(str(e))
            print('\t[Thread %d] Exception while getting profile %s' % (thread_id, str(url.rstrip())))
            okc.br.reset()
    
    exampleSession = None # Does this kill my dryscrape session and prevents webkit_server from running? 
    
    return profiles

我的dryscrape 设置是否正确? dryscrapewebkit_server 如何最终使用超过 4Gbs 的 urls 我访问 getProfileByUrlgetProfileQuestionsByUrl?我是否缺少任何可能会增加内存使用的设置?

【问题讨论】:

【参考方案1】:

我无法解决内存问题(我可以在单独的笔记本电脑上重现此问题)。我最终从dryscrape 切换到selenium(然后切换到phantomjs)。在我看来,PhantomJS 更胜一筹,而且它也不会占用太多内存。

【讨论】:

想继续干刮的小伙伴可以参考这个问题***.com/questions/36280450/…

以上是关于Webkit_server(从 python 的 dryscrape 调用)在访问每个页面时使用越来越多的内存。如何减少使用的内存?的主要内容,如果未能解决你的问题,请参考以下文章

Capybara-webkit、rspec 集成规范和 xvfb:webkit_server:致命 IO 错误:客户端被杀死

Python 从入门到精通推荐看哪些书籍呢?

Python从入门到精通— 初识Python

Python从入门到精通— 初识Python

Python从入门到精通— 初识Python

Python从入门到精通— 初识Python