提取两个不同标签之间的文本 美丽的汤

Posted

技术标签:

【中文标题】提取两个不同标签之间的文本 美丽的汤【英文标题】:Extract text between two different tags beautiful soup 【发布时间】:2018-12-09 18:34:58 【问题描述】:

我正在尝试从this web page中提取文章的文本内容。

我只是想提取文章内容,而不是“关于作者部分”。

问题是所有内容都不在像<div> 这样的标签内。因此我无法提取它们,因为它们都在<p> 标签内。当我提取所有<p> 标签时,我还得到了“关于作者”部分。我必须从这个网站上刮掉很多页面。有没有办法用漂亮的汤做到这一点?

我目前正在尝试:

p_tags=soup.find_all('p')
for row in p_tags:
    print(row)

【问题讨论】:

【参考方案1】:

您想要的所有段落都位于<div class="td-post-content"> 标记内,以及作者信息的段落。但是,必需的<p> 标记是此<div> 标记的直接子级,而其他不需要的<p> 标记不是直接子级(它们嵌套在其他div 标记内)。

因此,您可以使用recursive=False 仅访问这些标签。

代码:

import requests
from bs4 import BeautifulSoup

headers = 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

r = requests.get('https://www.the-blockchain.com/2018/06/29/mcafee-labs-report-6x-increase-in-crypto-mining-malware-incidents-in-q1-2018/', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')

container = soup.find('div', class_='td-post-content')
for para in container.find_all('p', recursive=False):
    print(para.text)

输出:

Cybersecurity giant McAfee released its McAfee Labs Threat Report: June 2018 on Wednesday, outlining the growth and trends of new malware and cyber threats in Q1 2018. According to the report, coin mining malware saw a 623 percent growth in the first quarter of 2018, infecting 2.9 million machines in that period. McAfee Labs counted 313 publicly disclosed security incidents in the first three months of 2018, a 41 percent increase over the previous quarter. In particular, incidents in the healthcare sector rose 57 percent, with a significant portion involving Bitcoin-based ransomware that healthcare institutions were often compelled to pay.
Chief Scientist at McAfee Raj Samani said, “There were new revelations this quarter concerning complex nation-state cyber-attack campaigns targeting users and enterprise systems worldwide. Bad actors demonstrated a remarkable level of technical agility and innovation in tools and tactics. Criminals continued to adopt cryptocurrency mining to easily monetize their criminal activity.”
Sizeable criminal organizations are responsible for many of the attacks in recent months. In January, malware dubbed Golden Dragon attacked organizations putting together the Pyeongchang Winter Olympics in South Korea, using a malicious word attachment to install a script that would encrypt and send stolen data to an attacker’s command center. The Lazarus cybercrime ring launched a highly sophisticated Bitcoin phishing campaign called HaoBao that targeted global financial organizations, sending an email attachment that would scan for Bitcoin activity, credentials and mining data.
Chief Technology Officer at McAfee Steve Grobman said, “Cybercriminals will gravitate to criminal activity that maximizes their profit. In recent quarters we have seen a shift to ransomware from data-theft,  as ransomware is a more efficient crime. With the rise in value of cryptocurrencies, the market forces are driving criminals to crypto-jacking and the theft of cryptocurrency. Cybercrime is a business, and market forces will continue to shape where adversaries focus their efforts.”

【讨论】:

【参考方案2】:

您需要使用selenium,因为我尝试使用requests 来执行此操作,但它不起作用,因为使用javascript 加载数据并跟随bs4

import requests, bs4
from selenium import webdriver

driver = webdriver.Chrome('/usr/local/bin/chromedriver') 
website = "https://www.the-blockchain.com/2018/06/29/mcafee-labs-report-6x-increase-in-crypto-mining-malware-incidents-in-q1-2018/"
driver.get(website) 
html = driver.page_source
soup = bs4.BeautifulSoup(html, "html.parser")

elements = soup.select('#wpautbox_latest-post > ul > li')
for elem in elements:
    print(elem.text)

输出

McAfee Labs Report 6x Increase in Crypto Mining Malware Incidents in Q1 2018 - June 29, 2018
Facebook Updates Policy To Allow Vetted Crypto Businesses to Advertise, ICOs Still Banned - June 27, 2018
Following in Vitalik’s Footsteps? Polkadot’s Habermeier Awarded Thiel Fellowship - June 26, 2018
And many other article titles

【讨论】:

其实这里不需要硒。只需要传递一些标题。 也许他用selenium我不知道,但你回答也不错 这会输出所有相关文章的列表。那不是我想要的。顺便说一句,我得到了答案。不过谢谢:)【参考方案3】:

如果你想把About the author连同段落之外的东西一起踢掉,你可以通过在td-post-content类中的p标签下打印span标签的内容来实现。为了简洁起见,我在这种情况下使用选择器。也试试下面的方法。

import requests
from bs4 import BeautifulSoup

url = 'https://www.the-blockchain.com/2018/06/29/mcafee-labs-report-6x-increase-in-crypto-mining-malware-incidents-in-q1-2018/'

res = requests.get(url,headers="User-Agent":"defined")
soup = BeautifulSoup(res.text, 'lxml')
paragraph = [p.text for p in soup.select('.td-post-content p span')]
print(paragraph)

【讨论】:

以上是关于提取两个不同标签之间的文本 美丽的汤的主要内容,如果未能解决你的问题,请参考以下文章

美丽的汤和正则表达式

使用美丽的汤从标签中提取“href”

从美丽的汤标签中提取href [重复]

使用美丽的汤刮痧多个URL

美丽的汤找不到标签

美丽的汤 KeyError 'href' 但肯定存在