python 使用BeautifulSoup和Python从网页中提取文本

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python 使用BeautifulSoup和Python从网页中提取文本相关的知识,希望对你有一定的参考价值。

import requests
from bs4 import BeautifulSoup

url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script',
    # there may be more elements you don't want, such as "style", etc.
]

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)

print(output)

以上是关于python 使用BeautifulSoup和Python从网页中提取文本的主要内容,如果未能解决你的问题,请参考以下文章

Python爬虫教程-24-数据提取-BeautifulSoup4

Python的基本Web Scraping(Beautifulsoup和Requests)

python2.7 BeautifulSoup 模块 报错

python学习之爬虫:BeautifulSoup

在Python中导入BeautifulSoup时出错

如何将BeautifulSoup导入到python方法OpenERP 7模块中