python 使用BeautifulSoup和Python从网页中提取文本

Posted 2021-05-08

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python 使用BeautifulSoup和Python从网页中提取文本相关的知识，希望对你有一定的参考价值。

import requests
from bs4 import BeautifulSoup

url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script',
    # there may be more elements you don't want, such as "style", etc.
]

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)

print(output)

以上是关于python 使用BeautifulSoup和Python从网页中提取文本的主要内容，如果未能解决你的问题，请参考以下文章