Python BeautifulSoup 获取文本第一个标签
Posted
技术标签:
【中文标题】Python BeautifulSoup 获取文本第一个标签【英文标题】:Python BeautifulSoup get text first tag 【发布时间】:2016-03-26 14:17:22 【问题描述】:我需要在python中使用BeautifulSoup将标签的文本获取到li
标签的第一级。
问题是标签包含其他li
标签,而这些标签又包含其他标签。
示例 html:
<li>
<a href="http://lol.lol">Text1</a><-- GET THIS
<li>
<a href="http://lol.lol">Text1</a><-- DON'T GET THIS
</li>
</li>
<li>
<a href="http://lol.lol">Text2</a><-- GET THIS
<li>
<a href="http://lol.lol">Text2-2</a><-- DON'T GET THIS
</li>
</li>
编辑:
我一直在测试,但我并没有只取出第一个 a 标签。
这是我尝试提取的原始片段:
<div id="categories_block_left" class="block block-highlighted">
<h4 class="title_block">
<span class="icon-box fa fa-bars"></span>
RELOJES
</h4>
<div class="block_content" style="">
<ul class="list-block list-group bullet tree dynamized" style="display: block;">
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/50-outlet" title="OUTLET">
OUTLET
<span id="leo-cat-50" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/47-adidas" title="Adidas">
Adidas
<span id="leo-cat-47" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/125-miss-sixty" title="Miss Sixty">
Miss Sixty
<span id="leo-cat-125" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/49-converse" title="Converse">
Converse
<span id="leo-cat-49" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/61-armand-basi" title="Armand Basi">
Armand Basi
<span id="leo-cat-61" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/79-marea" title="Marea">
Marea
<span id="leo-cat-79" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/86-marc-ecko" title="Marc Ecko">
Marc Ecko
<span id="leo-cat-86" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/107-festina" title="Festina">
Festina
<span id="leo-cat-107" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/135-seiko" title="Seiko">
Seiko
<span id="leo-cat-135" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/221-relojes-swatch-liquidar" title="Relojes Swatch liquidar">
Relojes Swatch liquidar
<span id="leo-cat-221" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/184-lotus" title="Lotus">
Lotus
<span id="leo-cat-184" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/195-lotus-hombre" title="Lotus Hombre">
Lotus Hombre
<span id="leo-cat-195" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/196-lotus-mujer" title="Lotus Mujer">
Lotus Mujer
<span id="leo-cat-196" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/236-lotus-infantil" title="Lotus Infantil">
Lotus Infantil
<span id="leo-cat-236" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<a href="http://www.joyeriasanchez.com/218-daniel-wellington" title="Daniel Wellington">
Daniel Wellington
<span id="leo-cat-218" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/197-viceroy" title="Viceroy">
Viceroy
<span id="leo-cat-197" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/198-viceroy-hombre" title="Viceroy Hombre">
Viceroy Hombre
<span id="leo-cat-198" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/199-viceroy-mujer" title="Viceroy Mujer">
Viceroy Mujer
<span id="leo-cat-199" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/235-viceroy-infantil" title="Viceroy Infantil">
Viceroy Infantil
<span id="leo-cat-235" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<a href="http://www.joyeriasanchez.com/51-ice-watch" title="Ice watch">
Ice watch
<span id="leo-cat-51" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/64-relojes-swatch" title="Relojes Swatch">
Relojes Swatch
<span id="leo-cat-64" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/80-mark-maddox" title="Mark Maddox">
Mark Maddox
<span id="leo-cat-80" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/81-ferrari" title="Ferrari">
Ferrari
<span id="leo-cat-81" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/173-relojes-cadete" title="Relojes Cadete">
Relojes Cadete
<span id="leo-cat-173" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/200-tous" title="Tous">
Tous
<span id="leo-cat-200" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/201-tous-kids" title="Tous Kids">
Tous Kids
<span id="leo-cat-201" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/203-tous-mujer" title="Tous Mujer">
Tous Mujer
<span id="leo-cat-203" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/204-tous-hombre" title="Tous Hombre">
Tous Hombre
<span id="leo-cat-204" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/220-certina" title="Certina">
Certina
<span id="leo-cat-220" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</div>
</div>
这是我尝试提取的代码:
req2 = requests.get(url2)
html2 = BeautifulSoup(req2.text)
catmenu = html2.find('div', 'id':'categories_block_left')
categorys = catmenu.find_all('li', recursive=False)
for cat in categorys:
categor = cat.find('a').getText()
print (" SubCategor:%s" % categor)
但是没有返回值,我只需要获取第一个a
标签。
示例:
奥特莱斯, 莲花, 丹尼尔惠灵顿, 总督, 冰表, Relojes Swatch, 马克·马多克斯, 法拉利, Relojes Cadete, 托斯, 雪铁纳
【问题讨论】:
【参考方案1】:你可以在find_all
方法中指定recursive=False
,这只会返回***li
标签:
In [62]: soup.find_all('li', recursive=False)
Out[62]:
[<li>
<a href="http://lol.lol">Text1</a>
<li>
<a href="http://lol.lol">Text1</a>
</li>
</li>, <li>
<a href="http://lol.lol">Text2</a>
<li>
<a href="http://lol.lol">Text2-2</a>
</li></li>]
然后您可以从每个li
的第一个a
标记中检索文本:
In [63]: [li.find('a').text for li in soup.find_all('li', recursive=False)]
Out[63]: ['Text1', 'Text2']
【讨论】:
以上是关于Python BeautifulSoup 获取文本第一个标签的主要内容,如果未能解决你的问题,请参考以下文章
Python/BeautifulSoup - 如何从元素中删除所有标签?
Python爬虫:想听榜单歌曲?使用BeautifulSoup库只需要14行代码即可搞定
Python爬虫:想听榜单歌曲?使用BeautifulSoup库只需要14行代码即可搞定