Beautiful Soup 4并没有删除此网页上的所有html

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Beautiful Soup 4并没有删除此网页上的所有html相关的知识,希望对你有一定的参考价值。

所以,我正在尝试学习如何使用python进行网络抓取,并且为此,我想弄清楚如何从this website中抓取所有音频文件。

所以,这是我当前的代码

from bs4 import BeautifulSoup
import requests

source = requests.get('https://www.nasa.gov/connect/sounds/index.html').text

soup = BeautifulSoup(source, 'lxml')

print(soup)

但是,我不认为它会从页面上拉下所有HTML,因为这是我得到的输出

<!DOCTYPE html>
<html class="no-js" dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#">
<head>
<meta content="IE=Edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta charset="utf-8"/>
<meta content="NASA" property="og:site_name"/>
<link href="http://www.w3.org/1999/xhtml/vocab" rel="profile"/>
<link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<meta content="text/html" name="dc.format"/>
<meta content="Text" name="dc.type"/>
<meta content="und" name="dc.language"/>
<meta content="/connect/sounds/index.html" name="dc.identifier"/>
<meta content="2015-01-26T09:44-05:00" name="dc.date"/>
<meta content="Jim Wilson" name="dc.creator"/>
<meta content="Audio and Ringtones" name="dc.title"/>
<meta content="/connect/sounds/index.html" property="twitter:url"/>
<meta content="11348282" property="twitter:site:id"/>
<meta content="@NASA" property="twitter:site"/>
<meta content="article" property="og:type"/>
<link href="/connect/sounds/index.html" rel="shortlink"/>
<meta content="NASA.gov brings you the latest images, videos and news from America's space agency. Get the latest updates on NASA missions, watch NASA TV live, and learn about our quest to reveal the unknown and benefit all humankind." name="description"/>
<meta content="http://www.nasa.gov/sites/default/files/images/potw1335a_0.jpg" property="twitter:image1"/>
<meta content="NASA.gov brings you the latest images, videos and news from America's space agency. Get the latest updates on NASA missions, watch NASA TV live, and learn about our quest to reveal the unknown and benefit all humankind." property="og:description"/>
<meta content="http://www.nasa.gov/sites/default/files/files/nasa_insignia_300.jpg" property="og:image"/>
<meta content="gallery" property="twitter:card"/>
<meta content="NASA brings you images, videos and features from the unique perspective of America's space agency. Get updates on missions, watch NASA TV, read blogs, view the latest discoveries, and 
more." property="twitter:description"/>
<meta content="http://www.nasa.gov/sites/default/files/images/astro.jpg" property="twitter:image0"/>
<meta content="http://www.nasa.gov/sites/default/files/images/earth_1000.jpg" property="twitter:image2"/>
<link href="/connect/sounds/index.html" rel="canonical"/>
<meta content="http://www.nasa.gov/sites/default/files/images/Aeroplane.jpeg" property="twitter:image3"/>
<meta content="Audio and Ringtones" property="og:title"/>
<meta content="http://www.nasa.gov/connect/sounds/index.html" property="og:url"/>
<meta content="Audio and Ringtones" property="twitter:title"/>
<meta content="http://www.nasa.gov" property="twitter:image"/>
<meta content="Drupal 7 (http://drupal.org)" name="generator"/>
<script type="application/ld+json">
    "@context": "http://schema.org",
    "@graph": [
        
            "@type": "WebPage",
            "@id": "https://www.nasa.gov/connect/sounds/index.html",
            "name": "Audio and Ringtones",
            "description": "NASA.gov brings you the latest images, videos and news from America\u0027s space agency. Get the latest updates on NASA missions, watch NASA TV live, and learn about our quest to reveal the unknown and benefit all humankind.",
            "author": 
                "@type": "Organization",
                "@id": "https://www.nasa.gov/connect/sounds/index.html",
                "name": "NASA",
                "url": "https://www.nasa.gov",
                "sameAs": [
                    "https://twitter.com/nasa",
                    "https://www.facebook.com/nasa",
                    "https://instagram.com/nasa",
                    "https://plus.google.com/+NASA"
                ]
            ,
            "publisher": 
                "@type": "Organization",
                "@id": "https://www.nasa.gov/connect/sounds/index.html",
                "name": "NASA",
                "url": "https://www.nasa.gov",
                "sameAs": "https://twitter.com/nasa,https://www.facebook.com/nasa,https://instagram.com/nasa,https://plus.google.com/+NASA",
                "logo": 
                    "@type": "ImageObject",
                    "url": "https://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg",
                    "width": "110",
                    "height": "92"
                
            
        ,
        
            "@type": "WebSite",
            "@id": "www.nasa.gov",
            "name": "NASA",
            "url": "www.nasa.gov"
        
    ]
</script>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=10.0" name="viewport"/>
<title>Audio and Ringtones | NASA</title>
<meta content="%7B%22modulePrefix%22%3A%22nasa%22%2C%22environment%22%3A%22development%22%2C%22baseURL%22%3A%22/%22%2C%22locationType%22%3A%22none%22%2C%22EmberENV%22%3A%7B%22FEATURES%22%3A%7B%7D%7D%2C%22APP%22%3A%7B%22LOG_ACTIVE_GENERATION%22%3Atrue%2C%22LOG_VIEW_LOOKUPS%22%3Atrue%7D%2C%22contentSecurityPolicyHeader%22%3A%22Content-Security-Policy-Report-Only%22%2C%22contentSecurityPolicy%22%3A%7B%22default-src%22%3A%22%27none%27%22%2C%22script-src%22%3A%22%27self%27%20%27unsafe-eval%27%22%2C%22font-src%22%3A%22%27self%27%22%2C%22connect-src%22%3A%22%27self%27%22%2C%22img-src%22%3A%22%27self%27%22%2C%22style-src%22%3A%22%27self%27%22%2C%22media-src%22%3A%22%27self%27%22%7D%2C%22exportApplicationGlobal%22%3Atrue%7D" name="nasa/config/environment"/>
<link href="/sites/all/themes/custom/nasatwo/images/apple-touch-icon.png" rel="apple-touch-icon"/>
<link href="/sites/all/themes/custom/nasatwo/images/apple-touch-icon-76x76.png" rel="apple-touch-icon" sizes="76x76"/>
<link href="/sites/all/themes/custom/nasatwo/images/apple-touch-icon-120x120.png" rel="apple-touch-icon" sizes="120x120"/>
<link href="/sites/all/themes/custom/nasatwo/images/apple-touch-icon-152x152.png" rel="apple-touch-icon" sizes="152x152"/>
<style>
@import url("/sites/all/modules/custom/scald_before_after_image/scald_before_after_image.css?");
@import url("/sites/all/modules/custom/scald_htmlsnippet/scald_htmlsnippet.css?");
@import url("/sites/all/modules/custom/scald_iframe/scald_iframe.css?");
</style>
<link href="/sites/all/themes/custom/nasatwo/css/vendor.css?" media="all" rel="stylesheet" type="text/css"/>
<link href="/sites/all/themes/custom/nasatwo/css/nasa.css?" media="all" rel="stylesheet" type="text/css"/>
<script id="_fed_an_ua_tag" language="javascript" src="https://dap.digitalgov.gov/Universal-Federated-Analytics-Min.js?agency=NASA&amp;yt=true&amp;dclink=true"></script>
<script type="text/javascript">
    // DO NOT MODIFY BELOW THIS LINE *****************************************
    ;(function (g) 
      var d = document, am = d.createElement('script'), h = d.head || d.getElementsByTagName("head")[0], fsr = 'fsReady',
        aex = 
          "src": "//gateway.answerscloud.com/nasa-gov/production/gateway.min.js",
          "type": "text/javascript",
          "async": "true",
          "data-vendor": "fs",
          "data-role": "gateway"
        ;
      for (var attr in aex)am.setAttribute(attr, aex[attr]);h.appendChild(am);g[fsr] = function () var aT = '__' + fsr + '_stk__';g[aT] = g[aT] || [];g[aT].push(arguments);;
    )(window);
    // DO NOT MODIFY ABOVE THIS LINE *****************************************
    </script>
<script>window.landingPageID = 336285</script>
<script>window.Drupal = behaviors: ;</script>
<script src="/sites/all/themes/custom/nasatwo/js/vendor.js?"></script>
<script src="/sites/all/themes/custom/nasatwo/js/nasa.js?"></script>
</head>
<body class="html not-front not-logged-in page-node page-node- page-node-336285 node-type-landing-page-2015 section-connect">
<div class="l-page ember-init-hide">
<header class="l-header container-fluid" role="banner"></header>
<div class="l-main">
<div class="l-content container-fluid" id="main" role="main">
<script>
window.forcedRoute = "landingPage";
window.cardFeed = [];
</script>
</div>
</div>
<footer class="l-footer container-fluid" role="contentinfo">
<script async="async" src="//script.crazyegg.com/pages/scripts/0070/1109.js"></script>
</footer>
</div>
<script>
      /**
       * © 2011-2014 iPerceptions, Inc. All rights reserved. Do not distribute.
       * iPerceptions provides this code 'as is' without warranty of any kind,
       * either express or implied.
       */

     window.iperceptionskey = 'CTS00001';
     (function () 
       var a = document.createElement('script'),
           b = document.getElementsByTagName('body')[0];
       a.type = 'text/javascript';
       a.async = true;
       a.src = '//universal.iperceptions.com/wrapper.js';b.appendChild(a);
     )();
    </script>
</body>
</html>

因此,如您所见,包含音频链接下载文件的超链接根本不会出现。而且,当您转到该网页时,您可以检查该网页并发现它并未将其全部拉下。关于这可能是什么的任何想法?感谢您的帮助。

答案

[正如其他人在评论中提到的,页面是动态呈现的。但是,嘿,如果您不追求可靠性(例如“我只是想立即拿走东西,并且如果我的脚本很快就坏了,也不要太在意”),那么您可以看看流量...] >

快速浏览后,您意识到似乎要获取的实际html主体似乎包裹在json中,特别是在https://www.nasa.gov/api/1/record/node/336285.json中>

知道这一点,以一种快速而肮脏的方式将其拾起,并重新指定其中的mp3链接,这非常简单:

import requests, json, re

source = requests.get('https://www.nasa.gov/api/1/record/node/336285.json')
j = json.loads(source.content.decode())
body = j['landingPage']['body']

for mp3 in re.findall(r"http.*?\.mp3", body):
    print(mp3)

下面的代码几乎相同,但是还将下载所有的mp3:

import requests, json, re, html, string

source = requests.get('https://www.nasa.gov/api/1/record/node/336285.json')
j = json.loads(source.content.decode())
body = j['landingPage']['body']

for mp3 in re.findall(r"(http.*?\.mp3).*?\>(.*?)\<", body):
    link = mp3[0]
    title = html.unescape(mp3[1])
    filename = ''.join(c for c in title if c in "-_.() %s%s" % (string.ascii_letters, string.digits)) + ".mp3"
    print("Downloading %s..." % filename)
    with open(filename, "wb") as target:
        target.write(requests.get(link).content)

以上是关于Beautiful Soup 4并没有删除此网页上的所有html的主要内容,如果未能解决你的问题,请参考以下文章

4.2 使用 Beautiful Soup

如何使用 Beautiful Soup 提取此 HTML 元素属性的值?

python 之beautiful soup 4 warning

Beautiful Soup 4 导入错误?安装错误?

Beautiful Soup 4 CSS 选择器的工作方式与教程显示的方式不同

20190221 beautiful soup 入门