Python BeautifulSoup 获取文本第一个标签

Posted

技术标签:

【中文标题】Python BeautifulSoup 获取文本第一个标签【英文标题】:Python BeautifulSoup get text first tag 【发布时间】:2016-03-26 14:17:22 【问题描述】:

我需要在python中使用BeautifulSoup将标签的文本获取到li标签的第一级。

问题是标签包含其他li标签,而这些标签又包含其他标签。

示例 html

<li>
   <a href="http://lol.lol">Text1</a><-- GET THIS
   <li>
      <a href="http://lol.lol">Text1</a><-- DON'T GET THIS
   </li>
</li>
<li>
   <a href="http://lol.lol">Text2</a><-- GET THIS
   <li>
      <a href="http://lol.lol">Text2-2</a><-- DON'T GET THIS
   </li>
</li>

编辑:

我一直在测试,但我并没有只取出第一个 a 标签。

这是我尝试提取的原始片段:

<div id="categories_block_left" class="block block-highlighted">
<h4 class="title_block">
<span class="icon-box fa fa-bars"></span>
RELOJES
</h4>
<div class="block_content" style="">
<ul class="list-block list-group bullet tree dynamized" style="display: block;">
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/50-outlet" title="OUTLET">
OUTLET
<span id="leo-cat-50" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/47-adidas" title="Adidas">
Adidas
<span id="leo-cat-47" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/125-miss-sixty" title="Miss Sixty">
Miss Sixty
<span id="leo-cat-125" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/49-converse" title="Converse">
Converse
<span id="leo-cat-49" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/61-armand-basi" title="Armand Basi">
Armand Basi
<span id="leo-cat-61" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/79-marea" title="Marea">
Marea
<span id="leo-cat-79" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/86-marc-ecko" title="Marc Ecko">
Marc Ecko
<span id="leo-cat-86" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/107-festina" title="Festina">
Festina
<span id="leo-cat-107" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/135-seiko" title="Seiko">
Seiko
<span id="leo-cat-135" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/221-relojes-swatch-liquidar" title="Relojes Swatch liquidar">
Relojes Swatch liquidar
<span id="leo-cat-221" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/184-lotus" title="Lotus">
Lotus
<span id="leo-cat-184" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/195-lotus-hombre" title="Lotus Hombre">
Lotus Hombre
<span id="leo-cat-195" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/196-lotus-mujer" title="Lotus Mujer">
Lotus Mujer
<span id="leo-cat-196" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/236-lotus-infantil" title="Lotus Infantil">
Lotus Infantil
<span id="leo-cat-236" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<a href="http://www.joyeriasanchez.com/218-daniel-wellington" title="Daniel Wellington">
Daniel Wellington
<span id="leo-cat-218" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/197-viceroy" title="Viceroy">
Viceroy
<span id="leo-cat-197" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/198-viceroy-hombre" title="Viceroy Hombre">
Viceroy Hombre
<span id="leo-cat-198" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/199-viceroy-mujer" title="Viceroy Mujer">
Viceroy Mujer
<span id="leo-cat-199" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/235-viceroy-infantil" title="Viceroy Infantil">
Viceroy Infantil
<span id="leo-cat-235" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<a href="http://www.joyeriasanchez.com/51-ice-watch" title="Ice watch">
Ice watch
<span id="leo-cat-51" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/64-relojes-swatch" title="Relojes Swatch">
Relojes Swatch
<span id="leo-cat-64" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/80-mark-maddox" title="Mark Maddox">
Mark Maddox
<span id="leo-cat-80" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/81-ferrari" title="Ferrari">
Ferrari
<span id="leo-cat-81" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/173-relojes-cadete" title="Relojes Cadete">
Relojes Cadete
<span id="leo-cat-173" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/200-tous" title="Tous">
Tous
<span id="leo-cat-200" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/201-tous-kids" title="Tous Kids">
Tous Kids
<span id="leo-cat-201" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/203-tous-mujer" title="Tous Mujer">
Tous Mujer
<span id="leo-cat-203" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/204-tous-hombre" title="Tous Hombre">
Tous Hombre
<span id="leo-cat-204" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/220-certina" title="Certina">
Certina
<span id="leo-cat-220" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</div>
</div>

这是我尝试提取的代码:

req2 = requests.get(url2)
        html2 = BeautifulSoup(req2.text)
        catmenu = html2.find('div', 'id':'categories_block_left')
        categorys = catmenu.find_all('li', recursive=False)
        for cat in categorys:
            categor = cat.find('a').getText()
            print ("   SubCategor:%s" % categor)

但是没有返回值,我只需要获取第一个a标签。 示例:

奥特莱斯, 莲花, 丹尼尔惠灵顿, 总督, 冰表, Relojes Swatch, 马克·马多克斯, 法拉利, Relojes Cadete, 托斯, 雪铁纳

【问题讨论】:

【参考方案1】:

你可以在find_all方法中指定recursive=False,这只会返回***li标签:

In [62]: soup.find_all('li', recursive=False)
Out[62]: 
[<li>
 <a href="http://lol.lol">Text1</a>
 <li>
 <a href="http://lol.lol">Text1</a>
 </li>
 </li>, <li>
 <a href="http://lol.lol">Text2</a>
 <li>
 <a href="http://lol.lol">Text2-2</a>
 </li></li>]

然后您可以从每个li 的第一个a 标记中检索文本:

In [63]: [li.find('a').text for li in soup.find_all('li', recursive=False)]
Out[63]: ['Text1', 'Text2']

【讨论】:

以上是关于Python BeautifulSoup 获取文本第一个标签的主要内容,如果未能解决你的问题,请参考以下文章

Python/BeautifulSoup - 如何从元素中删除所有标签?

Python爬虫:想听榜单歌曲?使用BeautifulSoup库只需要14行代码即可搞定

Python爬虫:想听榜单歌曲?使用BeautifulSoup库只需要14行代码即可搞定

使用Beautifulsoup通过文本获取Href

BeautifulSoup - 如何获取两个不同标签之间的所有文本?

如何使用 BeautifulSoup 在标签内获取 html 文本