Selenium/python：每次滚动后从动态加载的网页中提取文本

Posted 2023-02-23

技术标签:

【中文标题】Selenium/python：每次滚动后从动态加载的网页中提取文本【英文标题】：Selenium/python: extract text from a dynamically-loading webpage after every scroll 【发布时间】：2018-01-27 18:04:31 【问题描述】：

我正在使用 Selenium/python 自动向下滚动社交媒体网站并抓取帖子。我目前正在一次“点击”中提取所有文本滚动一定次数（下面的代码），但我想在每次滚动后只提取新加载的文本。

例如，如果页面最初包含文本“A，B，C”，那么在第一次滚动后显示“D，E，F”，我想存储“A，B，C”，然后滚动，然后存储“D、E、F”等。

我要提取的具体项目是帖子的日期和消息文本，可以分别使用 css 选择器 '.message-date' 和 '.message-body' 获得（例如，dates = driver.find_elements_by_css_selector('.message-date')）。

谁能建议如何在每次滚动后仅提取新加载的文本？

这是我当前的代码（它在我完成滚动之后提取所有日期/消息）：

from selenium import webdriver
import sys
import time
from selenium.webdriver.common.keys import Keys

#load website to scrape
driver = webdriver.PhantomJS()
driver.get("https://stocktwits.com/symbol/USDJPY?q=%24USDjpy")

#Scroll the webpage
ScrollNumber=3 #max scrolls
print(str(ScrollNumber)+ " scrolldown will be done.")
for i in range(1,ScrollNumber):  #scroll down X times
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3) #Delay between 2 scrolls down to be sure the page loaded
    ## I WANT TO SAVE/STORE ANY NEWLY LOADED POSTS HERE RATHER 
    ## THAN EXTRACTING IT ALL IN ONE GO AT THE END OF THE LOOP

# Extract messages and dates.
## I WANT TO EXTRACT THIS DATA ON THE FLY IN THE ABOVE
## LOOP RATHER THAN EXTRACTING IT HERE
dates = driver.find_elements_by_css_selector('.message-date')
messages = driver.find_elements_by_css_selector('.message-body')

【问题讨论】：

为什么在后台使用 API 时会出现这么多麻烦？ https://stocktwits.com/streams/poll?stream=symbol&max=92297722&stream_id=674&substream=top&item_id=674 感谢您提供信息，但您介意详细说明如何使用 API 下载数据吗？该网址对我不起作用。关于为什么我一开始不使用 API，a）我不知道如何做到这一点（我是 API 的新手）和 b）我听说 API 只能用于下载过去 30 天的数据，而我需要至少一年的帖子。您需要学习如何使用 Chrome 开发者工具或 Firebug 来查找网络请求。您可以先检索页面，然后从页面中找出 API url。问题是，如果页面滚动添加了新数据，它总是会通过调用 API 来添加它，该 API 将返回 JSON 或 html 或其他格式的数据。因此，如果您可以使用滚动手动获取年份数据，您也可以通过 API 获取。那是因为即使是网页也必须获取数据才能显示给你谢谢，我会调查一下。 【参考方案1】：

这正是您想要的。但是，我不会以这种方式抓取网站……它运行的时间越长，它就会变得越来越慢。 RAM 的使用也会失控。

import time
from hashlib import md5

import selenium.webdriver.support.ui as ui
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

URL = 'https://stocktwits.com/symbol/USDJPY?q=%24USDjpy'
CSS = By.CSS_SELECTOR

driver.get(URL)


def scrape_for(max_seconds=300):
    found = set()
    end_at = time.time() + max_seconds
    wait = ui.WebDriverWait(driver, 5, 0.5)

    while True:
        # find elements
        elms = driver.find_elements(CSS, 'li.messageli')

        for li in elms:
            # get the information we need about each post
            text = li.find_element(CSS, 'div.message-content')
            key = md5(text.text.encode('ascii', 'ignore')).hexdigest()

            if key in found:
                continue

            found.add(key)

            try:
                date = li.find_element(CSS, 'div.message-date').text
            except NoSuchElementException as e:
                date = None

            yield text.text, date

        if time.time() > end_at:
            raise StopIteration

        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        wait.until(EC.invisibility_of_element_located(
                       (CSS, 'div#more-button-loading')))

    raise StopIteration


for twit in scrape_for(60):
    print(twit)

driver.quit()

【讨论】：

【参考方案2】：

正如其他人所说，如果您可以通过直接点击 API 来完成您需要做的事情，那是您最好的选择。如果您绝对必须使用 Selenium，请参阅下面的解决方案。

为了满足我的需要，我做了类似下面的事情。

我正在利用 CSS 路径的 :nth-child() 方面在加载时单独查找元素。我还使用 selenium 的显式等待功能（通过 explicit 包，pip install explicit）来有效地等待元素加载。

脚本很快退出（不调用 sleep()），但是，网页本身在后台有很多垃圾，以至于 selenium 通常需要一段时间才能将控制权返回给脚本。

from __future__ import print_function

from itertools import count
import sys
import time

from explicit import waiter, CSS
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait as Wait


# The CSS selectors we will use
POSTS_BASE_CSS = 'ol.stream-list > li'              # All li elements
POST_BASE_CSS = POSTS_BASE_CSS + ":nth-child(0)"  # li child element at index 0
POST_DATE_CSS = POST_BASE_CSS + ' div.message-date'     # li child element at 0 with div.message-date
POST_BODY_CSS = POST_BASE_CSS + ' div.message-body'     # li child element at 0 with div.message-date



class Post(object):
    def __init__(self, driver, post_index):
        self.driver = driver
        self.date_css = POST_DATE_CSS.format(post_index)
        self.text_css = POST_BODY_CSS.format(post_index)

    @property
    def date(self):
        return waiter.find_element(self.driver, self.date_css, CSS).text

    @property
    def text(self):
        return waiter.find_element(self.driver, self.text_css, CSS).text


def get_posts(driver, url, max_screen_scrolls):
    """ Post object generator """
    driver.get(url)
    screen_scroll_count = 0

    # Wait for the initial posts to load:
    waiter.find_elements(driver, POSTS_BASE_CSS, CSS)

    for index in count(1):
        # Evaluate if we need to scroll the screen, or exit the generator
        # If there is no element at this index, it means we need to scroll the screen
        if len(driver.find_elements_by_css_selector('ol.stream-list > :nth-child(0)'.format(index))) == 0:
            if screen_scroll_count >= max_screen_scrolls:
                # Break if we have already done the max scrolls
                break

            # Get count of total posts on page
            post_count = len(waiter.find_elements(driver, POSTS_BASE_CSS, CSS))

            # Scroll down
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            screen_scroll_count += 1

            def posts_load(driver):
                """ Custom explicit wait function; waits for more posts to load in """
                return len(waiter.find_elements(driver, POSTS_BASE_CSS, CSS)) > post_count

            # Wait until new posts load in
            Wait(driver, 20).until(posts_load)

        # The list elements have sponsored ads and scripts mixed in with the posts we
        # want to scrape. Check if they have a div.message-date element and continue on
        # if not
        includes_date_css = POST_DATE_CSS.format(index)
        if len(driver.find_elements_by_css_selector(includes_date_css)) == 0:
            continue

        yield Post(driver, index)


def main():
    url = "https://stocktwits.com/symbol/USDJPY?q=%24USDjpy"
    max_screen_scrolls = 4
    driver = webdriver.Chrome()
    try:
        for post_num, post in enumerate(get_posts(driver, url, max_screen_scrolls), 1):
            print("*" * 40)
            print("Post #0".format(post_num))
            print("\nDate: 0".format(post.date))
            print("Text: 0\n".format(post.text[:34]))

    finally:
        driver.quit()  # Use try/finally to make sure the driver is closed


if __name__ == "__main__":
    main()

全面披露：我是explicit 包的创建者。您可以直接使用显式等待轻松地重写上述内容，但会牺牲可读性。

【讨论】：

谢谢，这真的很有帮助。我把赏金给了另一个用户，因为他首先回答了这个问题，但是你的代码也很好用。【参考方案3】：

我对 Facebook 帖子也有同样的问题。为此，我将帖子 ID（或帖子唯一的任何值，甚至是哈希）保存在一个列表中，然后当您再次进行查询时，您需要检查该 ID 是否在您的列表中。

此外，您可以删除已解析的 DOM，因此只有新的 DOM 会存在。

【讨论】：

【参考方案4】：

这并不是您真正要求的（它不是 selenium 解决方案），它实际上是使用页面在后台使用的 API 的解决方案。在我看来，使用 selenium 代替 API 有点过头了。

这里是使用 API 的脚本：

import re

import requests

STREAM_ID_REGEX = re.compile(r"data-stream='symbol-(?P<id>\d+)'")
CSRF_TOKEN_REGEX = re.compile(r'<meta name="csrf-token" content="(?P<csrf>[^"]+)" />')

URL = 'https://stocktwits.com/symbol/USDJPY?q=%24USDjpy'
API_URL = 'https://stocktwits.com/streams/poll?stream=symbol&substream=top&stream_id=stream_id&item_id=stream_id'


def get_necessary_info():
    res = requests.get(URL)

    # Extract stream_id
    match = STREAM_ID_REGEX.search(res.text)
    stream_id = match.group('id')

    # Extract CSRF token
    match = CSRF_TOKEN_REGEX.search(res.text)
    csrf_token = match.group('csrf')

    return stream_id, csrf_token


def get_messages(stream_id, csrf_token, max_messages=100):
    base_url = API_URL.format(stream_id=stream_id)

    # Required headers
    headers = 
        'x-csrf-token': csrf_token,
        'x-requested-with': 'XMLHttpRequest',
    

    messages = []
    more = True
    max_value = None
    while more:
        # Pagination
        if max_value:
            url = '&max='.format(base_url, max_value)
        else:
            url = base_url

        # Get JSON response
        res = requests.get(url, headers=headers)
        data = res.json()

        # Add returned messages
        messages.extend(data['messages'])

        # Check if there are more messages
        more = data['more']
        if more:
            max_value = data['max']

        # Check if we have enough messages
        if len(messages) >= max_messages:
            break

    return messages


def main():
    stream_id, csrf_token = get_necessary_info()
    messages = get_messages(stream_id, csrf_token)

    for message in messages:
        print(message['created_at'], message['body'])


if __name__ == '__main__':
    main()

以及输出的第一行：

Tue, 03 Oct 2017 03:54:29 -0000 $USDJPY (113.170) Daily Signal remains LONG from 109.600 on 12/09. SAR point now at 112.430. website for details
Tue, 03 Oct 2017 03:33:02 -0000 JPY: Selling JPY Via Long $USDJPY Or Long $CADJPY Still Attractive  - SocGen https://www.efxnews.com/story/37129/jpy-selling-jpy-long-usdjpy-or-long-cadjpy-still-attractive-socgen#.WdMEqnCGMCc.twitter
Tue, 03 Oct 2017 01:05:06 -0000 $USDJPY buy signal on 03 OCT 2017 01:00 AM UTC by AdMACD Trading System (Timeframe=H1) http://www.cosmos4u.net/index.php/forex/usdjpy/usdjpy-buy-signal-03-oct-2017-01-00-am-utc-by-admacd-trading-system-timeframe-h1 #USDJPY #Forex
Tue, 03 Oct 2017 00:48:46 -0000 $EURUSD nice H&S to take price lower on $USDJPY just waiting 4 inner trendline to break. built up a lot of strength
Tue, 03 Oct 2017 00:17:13 -0000 $USDJPY The Instrument reached the 100% from lows at 113.25 and sold in 3 waves, still can see 114.14 area.#elliottwave $USDX
...

【讨论】：

@jruf003 你能告诉我我的回答有什么问题吗？我假设您指的是被否决？大概这是因为您的帖子没有直接回答我的问题，但我不是拒绝您投票的人。事实上，我对你投了赞成票，因为你在字里行间阅读并给了我一个很好的解决方案（谢谢）。【参考方案5】：

您可以将消息的数量存储在一个变量中，并使用xpath 和position() 来获取新添加的帖子

dates = []
messages = []
num_of_posts = 1
for i in range(1, ScrollNumber):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)
    dates.extend(driver.find_elements_by_xpath('(//div[@class="message-date"])[position()>=' + str(num_of_posts) + ']'))
    messages.extend(driver.find_elements_by_xpath('(//div[contains(@class, "message-body")])[position()>=' + str(num_of_posts) + ']'))
    num_of_posts = len(dates)

【讨论】：

【参考方案6】：

只是在滚动后休眠，一定要像真机一样使用 selenium，它必须等到页面加载新内容。我建议您在 selenium 中使用等待功能，或者在代码中轻松添加睡眠功能以加载内容。

time.sleep(5)

【讨论】：

我已经在我提供的示例代码中使用time.sleep() - 您是否建议我可以以不同的方式使用此功能来解决我的特定问题（即仅提取新加载的帖子根据我原始问题第二段中的示例在每次滚动之后）？如果是这样，您能否详细说明如何实现这一目标？使用 python 很重要吗？亲爱的朋友，我明白这个问题让我用php查看那个页面..我会在第二天告诉你那太好了，谢谢。我只熟悉 R 和 Python，但会使用任何有效的语言。你有没有用上面讨论过的 php 检查过这个问题@kiamoz？

以上是关于Selenium/python：每次滚动后从动态加载的网页中提取文本的主要内容，如果未能解决你的问题，请参考以下文章