美丽汤中的提取链接[重复]

Posted

技术标签:

【中文标题】美丽汤中的提取链接[重复]【英文标题】:Extract link in beautiful soup [duplicate] 【发布时间】:2015-08-01 04:04:46 【问题描述】:

我是美丽汤的新手,我正在尝试弄清楚如何从嵌套数组中提取网站。该网站可以在“track-visit-website”类下找到两次。

这不是关于如何提取 href 的问题的重复。我已经在这个页面上成功地做到了。我正在尝试隔离实际的公司网站。

我尝试了几个代码,但无法正常工作。这是一个例子:

print(item.contents[2].find_all("a", "class": "track-visit-website")[0].a)

网站是YP.com Septic Search

这是网站上其中一项的代码:

<div class="info">
<h3 class="n">
<div class="info-section info-primary">
<p class="adr" itemprop="address" itemtype="http://schema.org/PostalAddress" itemscope="">
<span class="street-address" itemprop="streetAddress">2806 Farview Dr</span>
<span class="locality" itemprop="addressLocality">Fort Collins, </span>
<span itemprop="addressRegion">CO</span>
<span itemprop="postalCode">80524</span>
</p>
<div class="phones phone primary" itemprop="telephone">(970) 829-0852</div>
</div>
<div class="info-section info-secondary">
<div class="categories">
<div class="links">
<a class="track-visit-website" data-analytics=""click_id":6,"act":2,"dku":"http://www.affordablesepticanddraincleaning.com","FL":"url","TL":"off","target":"website","LOC":"http://www.affordablesepticanddraincleaning.com"" target="_blank" rel="nofollow" href="http://www.affordablesepticanddraincleaning.com" data-impressed="1">Website</a>
<a class="track-map-it directions" data-analytics=""click_id":13,"target":"website","act":4" href="/listings/1000775636908/directions" data-impressed="1">Directions</a>
<a class="track-more-info" data-analytics=""click_id":7,"target":"moreInfo","act":1,"FL":"list"" href="/fort-collins-co/mip/affordable-septic-drain-cleaning-llc-505109997?lid=1000775636908" data-impressed="1">More Info</a>
</div>

【问题讨论】:

这不是所提到的问题的重复。如果我只是寻找“a”或“href”,它会抛出几个错误。这个问题与处理嵌套数组有关。 【参考方案1】:

将此代码sn-p复制到python文件并运行

import re

content = """
<div class="info">
<h3 class="n">
<div class="info-section info-primary">
<p class="adr" itemprop="address" itemtype="http://schema.org/PostalAddress" itemscope="">
<span class="street-address" itemprop="streetAddress">2806 Farview Dr</span>
<span class="locality" itemprop="addressLocality">Fort Collins, </span>
<span itemprop="addressRegion">CO</span>
<span itemprop="postalCode">80524</span>
</p>
<div class="phones phone primary" itemprop="telephone">(970) 829-0852</div>
</div>
<div class="info-section info-secondary">
<div class="categories">
<div class="links">
<a class="track-visit-website" data-analytics=""click_id":6,"act":2,"dku":"http://www.affordablesepticanddraincleaning.com","FL":"url","TL":"off","target":"website","LOC":"http://www.affordablesepticanddraincleaning.com"" target="_blank" rel="nofollow" href="http://www.affordablesepticanddraincleaning.com" data-impressed="1">Website</a>
<a class="track-map-it directions" data-analytics=""click_id":13,"target":"website","act":4" href="/listings/1000775636908/directions" data-impressed="1">Directions</a>
<a class="track-more-info" data-analytics=""click_id":7,"target":"moreInfo","act":1,"FL":"list"" href="/fort-collins-co/mip/affordable-septic-drain-cleaning-llc-505109997?lid=1000775636908" data-impressed="1">More Info</a>
</div>
"""

websites = set(re.findall(r'http://[a-zA-Z0-9\.]*\.[a-z]2,',content)) # find all urls in the content
websites = list(websites)

print(websites)  # or in python2 => print websites

现在找到一种方法将其合并到您的代码中,获取 html,将其保存为内容,对其进行正则表达式并保存到文件中

网页抓取你要知道的正则表达式

阅读正则表达式,这里有一个很好的教程regex tutorial

【讨论】:

感谢您的链接。我通读了它,但还需要几次才能沉入其中。我尝试了该代码,但它没有返回任何内容。我尝试了 content = item.contents[1].find_all("a", "class": "track-visit-website")[0] 并让它返回该部分。有没有办法直接选择“LOC”属性? 我发送的正则表达式代码与美丽的汤无关,只需从网站上的其中一项中获取 html 代码并将其保存为 html,我已经更新了我的答案

以上是关于美丽汤中的提取链接[重复]的主要内容,如果未能解决你的问题,请参考以下文章

等待实际结果加载到请求和美丽的汤中 - Python [重复]

在汤中找到第一个链接找到[重复]

如何将我的网页抓取结果保存到美丽汤中的文本文件中?

在美丽的汤中找到下一个 div 标签

解析美丽汤后原始网页上的链接丢失

从表美汤中提取内容