如何获取下一页的页面源

Posted

技术标签:

【中文标题】如何获取下一页的页面源【英文标题】:How to get Page Source of next page 【发布时间】:2019-05-25 14:53:30 【问题描述】:

我正在尝试将驱动程序转换为 html 以使用漂亮的汤。问题是美化器正在打印的项目(也就是驱动程序中的那个)是登录页面的 html,而不是后面的那个(我确定登录成功,以及导航到下一页)。

驱动程序会包含第一页的源代码,而不是更新到我们导航到的那个,有什么原因吗?

这是我的代码:

import os
import random
import sys

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = 'https://www.instagram.com/gelsonfonteles/followers/'
driver = webdriver.Chrome()
driver.implicitly_wait(1)
driver.get(url)


username = driver.find_element_by_xpath('//*[@name="username"]')
password = driver.find_element_by_xpath('//*[@name="password"]')
login_btn = driver.find_element_by_xpath('//*[@class="_0mzm- sqdOP  L3NKy      "]')

username.send_keys("name")
password.send_keys("pass")

#login
login_btn.click()
driver.implicitly_wait(2)

soup = BeautifulSoup(driver.page_source,features="lxml")
print(soup.prettify())

driver.quit()

【问题讨论】:

【参考方案1】:

你已经很接近了。你只需要诱导WebDriverWait 为页面上任何元素的可见性 并且可以使用features="html.parser" 如下:

代码块:

# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = 'https://www.instagram.com/gelsonfonteles/followers/'
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get(url)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']"))).send_keys("username")
driver.find_element_by_css_selector("input[name='password']").send_keys("password")
driver.find_element_by_xpath("//button[normalize-space()='Log in']").click()
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h1[text()='gelsonfonteles']")))
soup = BeautifulSoup(driver.page_source,features="html.parser")
print(soup.prettify())
driver.quit()

控制台输出:

<!DOCTYPE html>
<html class="js logged-in client-root" lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <title>
   Gelson Fonteles ???? (@gelsonfonteles) • Instagram photos and videos
  </title>
  <meta content="noimageindex, noarchive" name="robots"/>
  <meta content="yes" name="mobile-web-app-capable"/>
  <meta content="#000000" name="theme-color"/>
  <meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, viewport-fit=cover" id="viewport" name="viewport"/>
  <link href="/data/manifest.json" rel="manifest"/>
  <link crossorigin="" href="https://graph.instagram.com" rel="preconnect"/>
  <link as="script" crossorigin="anonymous" href="/static/bundles/metro/ProfilePageContainer.js/68f09467caf1.js" rel="preload" type="text/javascript"/>
  <script async="" src="https://connect.facebook.net/signals/config/1425767024389221?v=2.8.35&amp;r=stable">
  </script>
  <script async="" src="//connect.facebook.net/en_US/fbevents.js">
  </script>
  <script id="facebook-jssdk" src="https://connect.facebook.net/en_US/sdk.js">
  </script>
  <script type="text/javascript">
   (function() 
  var docElement = document.documentElement;
  var classRE = new RegExp('(^|\\s)no-js(\\s|$)');
  var className = docElement.className;
  docElement.className = className.replace(classRE, '$1js$2');
)();
  </script>
  <script type="text/javascript">
   /*
 Copyright 2018 Google Inc. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
*/

(function()function g(a,c)b||(b=a,f=c,h.forEach(function(a)removeEventListener(a,l,e)),m())function m()b&amp;&amp;f&amp;&amp;0&lt;d.length&amp;&amp;(d.forEach(function(a)a(b,f)),d=[])function n(a,c)function k()g(a,c);d()function b()d()function d()removeEventListener("pointerup",k,e);removeEventListener("pointercancel",b,e)addEventListener("pointerup",k,e);addEventListener("pointercancel",b,e)function l(a)if(a.cancelable)var c=performance.now(),b=a.timeStamp;b&gt;c&amp;&amp;(c=+new Date);c-=b;"pointerdown"==a.type?n(c,
a):g(c,a)var e=passive:!0,capture:!0,h=["click","mousedown","keydown","touchstart","pointerdown"],b,f,d=[];h.forEach(function(a)addEventListener(a,l,e));window.perfMetrics=window.perfMetrics||;window.perfMetrics.onFirstInputDelay=function(a)d.push(a);m())();
  </script>
  <script type="text/javascript">
   (function() 
  if ('PerformanceObserver' in window &amp;&amp; 'PerformancePaintTiming' in window) 
    window.__bufferedPerformance = [];
    var ob = new PerformanceObserver(function(e) 
      window.__bufferedPerformance.push.apply(window.__bufferedPerformance,e.getEntries());
    );
    ob.observe(entryTypes:['paint']);
  
  window.__bufferedErrors = [];
  window.onerror = function(message, url, line, column, error) 
    window.__bufferedErrors.push(
      message: message,
      url: url,
      line: line,
      column: column,
      error: error
    );
    return false;
  ;
  window.__initialData = 
    pending: true,
    waiting: []
  ;
  function notifyLoaded(item, data) 
    item.pending = false;
    item.data = data;
    for (var i = 0;i &lt; item.waiting.length; ++i) 
      item.waiting[i].resolve(item.data);
    
    item.waiting = [];
  
  function notifyError(item, msg) 
    item.pending = false;
    item.error = new Error(msg);
    for (var i = 0;i &lt; item.waiting.length; ++i) 
      item.waiting[i].reject(item.error);
    
    item.waiting = [];
  
  window.__initialDataLoaded = function(initialData) 
    notifyLoaded(window.__initialData, initialData);
  ;
  window.__initialDataError = function(msg) 
    notifyError(window.__initialData, msg);
  ;
  window.__additionalData = ;
  window.__pendingAdditionalData = function(paths) 
    for (var i = 0;i &lt; paths.length; ++i) 
      window.__additionalData[paths[i]] = 
    pending: true,
    waiting: []
      ;
    
  ;
  window.__additionalDataLoaded = function(path, data) 
    if (path in window.__additionalData) 
      notifyLoaded(window.__additionalData[path], data);
     else 
      console.error('Unexpected additional data loaded "' + path + '"');
    
  ;
  window.__additionalDataError = function(path, msg) 
    if (path in window.__additionalData) 
      notifyError(window.__additionalData[path], msg);
     else 
      console.error('Unexpected additional data encountered an error "' + path + '": ' + msg);
    
  ;
)();
  </script>
  <link href="/static/images/ico/apple-touch-icon-76x76-precomposed.png/4272e394f5ad.png" rel="apple-touch-icon-precomposed" sizes="76x76"/>
  <link href="/static/images/ico/apple-touch-icon-120x120-precomposed.png/02ba5abf9861.png" rel="apple-touch-icon-precomposed" sizes="120x120"/>
  <link href="/static/images/ico/apple-touch-icon-152x152-precomposed.png/419a6f9c7454.png" rel="apple-touch-icon-precomposed" sizes="152x152"/>
  <link href="/static/images/ico/apple-touch-icon-167x167-precomposed.png/a24e58112f06.png" rel="apple-touch-icon-precomposed" sizes="167x167"/>
  <link href="/static/images/ico/apple-touch-icon-180x180-precomposed.png/85a358fb3b7d.png" rel="apple-touch-icon-precomposed" sizes="180x180"/>
  <link href="/static/images/ico/favicon-192.png/68d99ba29cc8.png" rel="icon" sizes="192x192"/>
  <link color="#262626" href="/static/images/ico/favicon.svg/fc72dd4bfde8.svg" rel="mask-icon"/>
  <link href="/static/images/ico/favicon.ico/36b3ee2d91ed.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="android-app://com.instagram.android/https/instagram.com/_u/gelsonfonteles/" rel="alternate"/>
  <meta content="Instagram" property="al:ios:app_name"/>
  <meta content="389801252" property="al:ios:app_store_id"/>
  <meta content="instagram://user?username=gelsonfonteles" property="al:ios:url"/>
  <meta content="Instagram" property="al:android:app_name"/>
  <meta content="com.instagram.android" property="al:android:package"/>
  <meta content="https://www.instagram.com/_u/gelsonfonteles/" property="al:android:url"/>
  <link href="https://www.instagram.com/gelsonfonteles/" rel="canonical"/>
  <meta content="94.2k Followers, 323 Following, 620 Posts - See Instagram photos and videos from Gelson Fonteles ???? (@gelsonfonteles)" name="description"/>
  <meta content="profile" property="og:type"/>
  <meta content="https://scontent-sin6-2.cdninstagram.com/vp/44c2bf3c9657d797afd661cd7026e189/5C9C5435/t51.2885-19/s150x150/46263173_2475614175787091_1415254353245110272_n.jpg?_nc_ht=scontent-sin6-2.cdninstagram.com" property="og:image"/>
  <meta content="Gelson Fonteles ???? (@gelsonfonteles) • Instagram photos and videos" property="og:title"/>
  <meta content="94.2k Followers, 323 Following, 620 Posts - See Instagram photos and videos from Gelson Fonteles ???? (@gelsonfonteles)" property="og:description"/>
  <meta content="https://www.instagram.com/gelsonfonteles/" property="og:url"/>
  <script type="application/ld+json">
   "@context":"http:\/\/schema.org","@type":"Person","name":"Gelson Fonteles \ud83d\udd8b\ud83d\udd04","alternateName":"@gelsonfonteles","description":"Fortaleza - CE , 23 anos!\nENCOMENDAS : Whats App: (85) 99760-7606","url":"http:\/\/www.facebook.com\/gelson.fonteles","mainEntityofPage":"@type":"ProfilePage","@id":"https:\/\/www.instagram.com\/gelsonfonteles\/","interactionStatistic":"@type":"InteractionCounter","interactionType":"http:\/\/schema.org\/FollowAction","userInteractionCount":"94237","image":"https:\/\/www.instagram.com\/static\/images\/ico\/favicon-200.png\/ab6eff595bb1.png","email":"gelsonfontelesart@gmail.com"
  </script>
  <link href="https://www.instagram.com/gelsonfonteles/" hreflang="x-default" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=en" hreflang="en" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=fr" hreflang="fr" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=it" hreflang="it" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=de" hreflang="de" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es" hreflang="es" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=zh-cn" hreflang="zh-cn" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=zh-tw" hreflang="zh-tw" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=ja" hreflang="ja" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=ko" hreflang="ko" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=pt" hreflang="pt" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=pt-br" hreflang="pt-br" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=af" hreflang="af" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=cs" hreflang="cs" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=da" hreflang="da" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=el" hreflang="el" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=fi" hreflang="fi" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=hr" hreflang="hr" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=hu" hreflang="hu" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=id" hreflang="id" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=ms" hreflang="ms" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=nb" hreflang="nb" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=nl" hreflang="nl" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=pl" hreflang="pl" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=ru" hreflang="ru" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=sk" hreflang="sk" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=sv" hreflang="sv" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=th" hreflang="th" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=tl" hreflang="tl" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=tr" hreflang="tr" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=hi" hreflang="hi" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=bn" hreflang="bn" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=gu" hreflang="gu" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=kn" hreflang="kn" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=ml" hreflang="ml" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=mr" hreflang="mr" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=pa" hreflang="pa" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=ta" hreflang="ta" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=te" hreflang="te" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=ne" hreflang="ne" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=si" hreflang="si" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=ur" hreflang="ur" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=vi" hreflang="vi" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=bg" hreflang="bg" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=fr-ca" hreflang="fr-ca" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=ro" hreflang="ro" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=sr" hreflang="sr" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=uk" hreflang="uk" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=zh-hk" hreflang="zh-hk" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-uy" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-gt" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-pe" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-cl" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-ar" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-mx" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-bo" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-cu" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-pa" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-ve" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-do" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-co" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-pr" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-cr" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-ec" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-ni" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-hn" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-sv" rel="alternate"/>
  <link href="https://www.instagram.com/gelsonfonteles/?hl=es-la" hreflang="es-py" rel="alternate"/>

【讨论】:

【参考方案2】:

driver.implicitly_wait(2) 在这种情况下是没用的。您需要使用explicit wait。比如

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

login_btn.click()
WebDriverWait(driver, 10).until(EC.url_changes('https://www.instagram.com/accounts/login/?next=/gelsonfonteles/followers/')) #  pass exact URL of Login page
soup = BeautifulSoup(driver.page_source,features="lxml")

EC.url_changes 允许等待指定的 URL 更改为其他内容。

您也可以等待某些特定元素出现在所需页面上

【讨论】:

以上是关于如何获取下一页的页面源的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 Scrapy FormRequest 在分页的 .asp 站点上模拟下一页链接请求

网页设计中页面默认为第二页,如何用js实现点击上一页跳转到第一页,点击下一页跳转

如何在appium中获取不可见的页面源?

Python爬虫怎么获取下一页的URL和网页内容

如何将目标页面的结果合并到scrapy中的当前页面?

python下用selenium的webdriver包如何在执行完点击下一页后没有获得下一页新打开页面的html源代码