python如何获取网页源码中整个<body>的内容？

Posted 2023-04-27

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python如何获取网页源码中整个<body>的内容？相关的知识，希望对你有一定的参考价值。

获取element中，<body>所有内容<body>

参考技术A 一般是这样，用request库获取html内容，然后用正则表达式获取内容。比如：
import requests
from bs4 import BeautifulSoup
txt=requests.get("https://www.gov.cn/").text //抓取网页
a=BeautifulSoup(txt,'html.parser') //构建解析器
print(a.body) //获取内容，也可以是a.title或者其他的标记内容

如何使用webbrowser控件获取网页源代码

使用WebBrowser控件获取网页源码的方法，大多数的人都是使用以下的方法获取：
(WebBrowser1.Document as IHtmlDocument2).body.outerHtml;
这种方法的美中不足就是只能获取网页<body>与</body>之间的网页源码，而<body>之外如<head>部分的网页源码就获取不到了，下面是某大牛老师给大家分享的方法，可参考：
procedure TForm1.Button1Click(Sender: TObject);
var
ole_index, oleObj: OleVariant;
i: integer;
begin
if WebBrowser1.Busy then Exit; //网页加载中，退出。
Memo1.Lines.Clear;
//获取主框架网址及网页源码
Memo1.Lines.Add(WebBrowser1.OleObject.document.url);
Memo1.Lines.Add(WebBrowser1.OleObject.document.documentElement.outerHTML);
Memo1.Lines.Add(\' \'); Memo1.Lines.Add(\' \'); //添加空行
//循环获取每一个子框架网址及网页源码
for i := 0 to WebBrowser1.OleObject.document.frames.length - 1 do
begin
ole_index := i;
oleObj := WebBrowser1.OleObject.document.frames.item(ole_index);
Memo1.Lines.Add(oleObj.document.url);
Memo1.Lines.Add(oleObj.document.documentElement.outerHtml);
Memo1.Lines.Add(\' \'); Memo1.Lines.Add(\' \'); //添加空行
end;
end; 参考技术A CHtmlView里有一个GetSource方法

以上是关于python如何获取网页源码中整个<body>的内容？的主要内容，如果未能解决你的问题，请参考以下文章