将 HTML(无序列)列表转换为嵌套的 Python 字典
Posted
技术标签:
【中文标题】将 HTML(无序列)列表转换为嵌套的 Python 字典【英文标题】:Convert HTML (unordered) list to nested Python dictionary 【发布时间】:2019-10-19 08:49:17 【问题描述】:如果我有一个如下所示的嵌套 html(无序列)列表:
<<ul style="">
<li class="jstree-last jstree-open" id="wfo-7000000004">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-7000000004">
<ins class="jstree-icon"> </ins>
Acoraceae
</a>
<ul style="">
<li class="jstree-last jstree-open" id="wfo-4000000350">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-4000000350">
<ins class="jstree-icon"> </ins>
Acorus
</a>
<ul style="">
<li class="jstree-open" id="wfo-0000350733">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350733">
<ins class="jstree-icon"> </ins>
Acorus calamus
</a>
<ul style="">
<li class="jstree-leaf" id="wfo-0000350841">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350841">
<ins class="jstree-icon"> </ins>
Acorus calamus var. americanus
</a>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000350949">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350949">
<ins class="jstree-icon"> </ins>
Acorus calamus var. angustatus
</a>
</li>
</ul>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000352676">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000352676">
<ins class="jstree-icon"> </ins>
Acorus gramineus
</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
如何在 Python 中形成一个嵌套字典?例如:
Acorales:
Acoraceae:
Acorus:
Acoruscalamus: [
Acoruscalamusvar.americanus,
Acoruscalamusvar.angustatus
],
Acorusgramineus
我认为 Beautiful Soup 和 HTML Parser 之类的库具有执行此操作的功能(在 python 中使用 for 循环),但我无法弄清楚.感谢您的帮助!
我试过这样:
def create_dic(soup):
return li.a.get_text().replace("\xa0", ""): create_dic(li)
for ul in soup('ul', recursive=False)
for li in ul('li', recursive=False)
但是,输出是这样的(其中 Acorus calamus var. americanus 和 Acorus calamus var. angustatus 应该在列表中,而 Acorus gramineus 不是字典):
'Acorales': 'Acoraceae': 'Acorus': 'Acorus calamus': 'Acorus calamus var. americanus': ,
'Acorus calamus var. angustatus': ,
'Acorus gramineus':
【问题讨论】:
Python Beautiful Soup how to JSON decode to `dict`?的可能重复 这能回答你的问题吗? Parsing nested HTML list with BeautifulSoup 【参考方案1】:我会回答这个问题,因为要让Parsing nested HTML list with BeautifulSoup 的答案起作用,你必须调用beautifulsoup 来解析你的html uls。我还将问题标记为重复,所以如果它的重复只是关闭/删除。
from bs4 import BeautifulSoup
htmlbody = '''
<<ul style="">
<li class="jstree-last jstree-open" id="wfo-7000000004">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-7000000004">
<ins class="jstree-icon"> </ins>
Acoraceae
</a>
<ul style="">
<li class="jstree-last jstree-open" id="wfo-4000000350">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-4000000350">
<ins class="jstree-icon"> </ins>
Acorus
</a>
<ul style="">
<li class="jstree-open" id="wfo-0000350733">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350733">
<ins class="jstree-icon"> </ins>
Acorus calamus
</a>
<ul style="">
<li class="jstree-leaf" id="wfo-0000350841">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350841">
<ins class="jstree-icon"> </ins>
Acorus calamus var. americanus
</a>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000350949">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350949">
<ins class="jstree-icon"> </ins>
Acorus calamus var. angustatus
</a>
</li>
</ul>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000352676">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000352676">
<ins class="jstree-icon"> </ins>
Acorus gramineus
</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
'''
def ul_to_dict(ul):
result =
for li in ul.find_all("li", recursive=False):
key = next(li.stripped_strings)
ul = li.find("ul")
if ul:
result[key] = ul_to_dict(ul)
else:
result[key] = None
return result
# Let BeautifulSoup do it's magic and parse ul from the HTML.
htmlbody = BeautifulSoup(htmlbody).ul
# run our function
ul_to_dict(htmlbody)
【讨论】:
以上是关于将 HTML(无序列)列表转换为嵌套的 Python 字典的主要内容,如果未能解决你的问题,请参考以下文章