BeautifulSoup getText 从 <p> 之间,不拾取后续段落

Posted

技术标签:

【中文标题】BeautifulSoup getText 从 <p> 之间,不拾取后续段落【英文标题】:BeautifulSoup getText from between <p>, not picking up subsequent paragraphs 【发布时间】:2012-09-09 05:46:33 【问题描述】:

首先,当谈到 Python 时,我是一个完全的新手。但是,我编写了一段代码来查看 RSS 提要、打开链接并从文章中提取文本。这是我目前所拥有的:

from BeautifulSoup import BeautifulSoup
import feedparser
import urllib

# Dictionaries
links = 
titles = 

# Variables
n = 0

rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80-    30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d"

# Parse the RSS feed
feed = feedparser.parse(rss_url)

# view the entire feed, one entry at a time
for post in feed.entries:
    # Create variables from posts
    link = post.link
    title = post.title
    # Add the link to the dictionary
    n += 1
    links[n] = link

for k,v in links.items():
    # Open RSS feed
    page = urllib.urlopen(v).read()
    page = str(page)
    soup = BeautifulSoup(page)

    # Find all of the text between paragraph tags and strip out the html
    page = soup.find('p').getText()

    # Strip ampersand codes and WATCH:
    page = re.sub('&\w+;','',page)
    page = re.sub('WATCH:','',page)

    # Print Page
    print(page)
    print(" ")

    # To stop after 3rd article, just whilst testing ** to be removed **
    if (k >= 3):
        break

这会产生以下输出:

>>> (executing lines 1 to 45 of "RSS_BeautifulSoup.py")
​Total deposits held with Guernsey banks at the end of June 2012 increased 2.1% in sterling terms by £2.1 billion from the end of March 2012 level of £101 billion, up to £103.1 billion. This is 9.4% lower than the same time a year ago.  Total assets and liabilities increased by £2.9 billion to £131.2 billion representing a 2.3% increase over the quarter though this was 5.7% lower than the level a year ago.  The higher figures reflected the effects both of volume and exchange rate factors.

The net asset value of total funds under management and administration has increased over the quarter ended 30 June 2012 by £711 million (0.3%) to reach £270.8 billion.For the year since 30 June 2011, total net asset values decreased by £3.6 billion (1.3%).

The Commission has updated the warranties on the Form REG, Form QIF and Form FTL to take into account the Commission’s Guidance Notes on Personal Questionnaires and Personal Declarations.  In particular, the following warranty (varies slightly dependent on the application) has been inserted in the aforementioned forms,

>>> 

问题是这是每篇文章的第一段,但我需要展示整篇文章。任何帮助将不胜感激。

【问题讨论】:

仅供参考,您可以使用soup = BeautifulSoup(urllib.urlopen(v)) 创建汤对象。 另外,有消息说如果你只是在学习 BeautifulSoup,你最好用 bs4。 【参考方案1】:

你越来越近了!

# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()

使用find(如您所见)在找到一个结果后停止。如果你想要所有的段落,你需要find_all。如果页面格式一致(仅查看一个),您还可以使用类似

soup.find('div','id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField')

在文章正文中归零。

【讨论】:

使用 soup.find('p').get_text() 也可以(为了符合 PEP 8)。【参考方案2】:

这适用于文本全部包含在&lt;p&gt; 标签中的特定文章。由于网络是一个丑陋的地方,并非总是如此。

通常,网站的文本会散布在各处,用不同类型的标签包裹(例如,可能在 &lt;span&gt;&lt;div&gt;&lt;li&gt; 中)。

到find all text nodes in the DOM,你可以使用soup.find_all(text=True)

这将返回一些不需要的文本,例如 &lt;script&gt;&lt;style&gt; 标记的内容。您需要过滤掉不需要的元素的文本内容。

blacklist = [
  'style',
  'script',
  # other elements,
]

text_elements = [t for t in soup.find_all(text=True) if t.parent.name not in blacklist]

如果您正在使用一组已知的标签,则可以使用相反的方法进行标签:

whitelist = [
  'p'
]

text_elements = [t for t in soup.find_all(text=True) if t.parent.name in whitelist]

【讨论】:

【参考方案3】:

get_text

htmldata = getdata("https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed") 
soup = BeautifulSoup(htmldata, 'html.parser') 
data = '' 
for data in soup.find_all("p"): 
    print(data.get_text()) 

【讨论】:

以上是关于BeautifulSoup getText 从 <p> 之间,不拾取后续段落的主要内容,如果未能解决你的问题,请参考以下文章

爬虫---解析

使用 GetText 从剪贴板获取文本 - 避免空剪贴板出错

LFS 系列从零开始 DIY Linux 系统:构建 LFS 系统 - Gettext-0.19.4

无法从 HTML Dom 获取值/文本。我使用了 Selenium WebDriver 的 getAttribute() 和 getText()

LFS 系列从零开始 DIY Linux 系统:构建 LFS 系统 - Gettext-0.19.4

GNU gettext无法从Javascript中提取字符串