python3解析XML
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python3解析XML相关的知识,希望对你有一定的参考价值。
我有以下xml,其中包含各种电子邮件服务提供商的电子邮件配置,我正在尝试将这些信息解析为dict; hostname,is_ssl,port,protocol ..etc
<domains>
<domain>
<name>zoznam.sk</name>
<description>Zoznam Slovakia</description>
<service>
<hostname>imap.zoznam.sk</hostname>
<port>143</port>
<protocol>IMAP</protocol>
<authentication>PLAIN</authentication>
<usernameIncludesDomain/>
</service>
<service>
<hostname>smtp.zoznam.sk</hostname>
<port>587</port>
<protocol>SMTP</protocol>
<authentication>PLAIN</authentication>
<usernameIncludesDomain/>
</service>
</domain>
<domain>
<name>123mail.org</name>
<description>123mail.org</description>
<service>
<hostname>imap.fastmail.com</hostname>
<port>993</port>
<protocol>IMAP</protocol>
<ssl/>
<requires/>
<authentication>PLAIN</authentication>
<usernameIncludesDomain/>
</service>
<service>
<hostname>smtp.fastmail.com</hostname>
<port>587</port>
<protocol>SMTP</protocol>
<ssl/>
<requires/>
<authentication>PLAIN</authentication>
<usernameIncludesDomain/>
</service>
</domain>
<domain>
<name>Netvigator.com</name>
<description>netvigator.com</description>
<service>
<hostname>corpmail1.netvigator.com</hostname>
<port>995</port>
<protocol>POP</protocol>
<ssl/>
<authentication>NONE</authentication>
<usernameIncludesDomain/>
</service>
<service>
<hostname>corpmail1.netvigator.com</hostname>
<port>587</port>
<protocol>SMTP</protocol>
<ssl/>
<authentication>NONE</authentication>
<usernameIncludesDomain/>
</service>
</domain>
</domains>
我试图解析名称进行测试,但无法成功,我需要python。
import xml.etree.ElementTree as ET
configs_file = 'isp_list.xml'
def parseXML(xmlfile):
# create element tree object
tree = ET.parse(xmlfile)
# get root element
root = tree.getroot()
# create empty list for configs items
configs = []
# iterate items
for item in root.findall('domains/domain'):
value = item.get('name')
# test
print(value)
# append news dictionary to items list
configs.append(item)
# return items list
return configs
我感谢您的帮助。谢谢。
答案
您仍然可以使用bs4生成一个字典。
对于if else行,您可以使用更紧凑的语法,例如:
'ssl' : getattr(item.find('ssl'), 'text', 'N/A')
脚本:
from bs4 import BeautifulSoup as bs
xml = '''
<domains>
<domain>
<name>zoznam.sk</name>
<description>Zoznam Slovakia</description>
<service>
<hostname>imap.zoznam.sk</hostname>
<port>143</port>
<protocol>IMAP</protocol>
<authentication>PLAIN</authentication>
<usernameIncludesDomain/>
</service>
<service>
<hostname>smtp.zoznam.sk</hostname>
<port>587</port>
<protocol>SMTP</protocol>
<authentication>PLAIN</authentication>
<usernameIncludesDomain/>
</service>
</domain>
<domain>
<name>123mail.org</name>
<description>123mail.org</description>
<service>
<hostname>imap.fastmail.com</hostname>
<port>993</port>
<protocol>IMAP</protocol>
<ssl/>
<requires/>
<authentication>PLAIN</authentication>
<usernameIncludesDomain/>
</service>
<service>
<hostname>smtp.fastmail.com</hostname>
<port>587</port>
<protocol>SMTP</protocol>
<ssl/>
<requires/>
<authentication>PLAIN</authentication>
<usernameIncludesDomain/>
</service>
</domain>
<domain>
<name>Netvigator.com</name>
<description>netvigator.com</description>
<service>
<hostname>corpmail1.netvigator.com</hostname>
<port>995</port>
<protocol>POP</protocol>
<ssl/>
<authentication>NONE</authentication>
<usernameIncludesDomain/>
</service>
<service>
<hostname>corpmail1.netvigator.com</hostname>
<port>587</port>
<protocol>SMTP</protocol>
<ssl/>
<authentication>NONE</authentication>
<usernameIncludesDomain/>
</service>
</domain>
</domains>
'''
data = {}
soup = bs(xml, 'lxml')
for domain in soup.select('domain'):
name = domain.select_one('name').text
data[name] = {
'name' : name,
'desc' : domain.select_one('description').text,
'services' : {}
}
i = 1
for item in domain.select('service'):
service = {
'hostname' : item.select_one('hostname').text if item.select_one('hostname') else 'N/A',
'port' : item.select_one('port').text if item.select_one('port') else 'N/A',
'protocol' : item.select_one('protocol').text if item.select_one('protocol').text else 'N/A',
'ssl' : item.select_one('ssl').text if item.select_one('ssl') else 'N/A',
'requires' : item.select_one('requires : ').text if item.select_one('requires : ') else 'N/A',
'authentication' : item.select_one('authentication').text if item.select_one('authentication') else 'N/A',
'usernameincludesdomain' : item.select_one('usernameincludesdomain').text if item.select_one('usernameincludesdomain') else 'N/A'
}
data[name]['services'][str(i)] = service
i+=1
print(data)
查看结构here
如果你真的将xml转换为类似json的结构,那么像untangle
这样的库可以工作吗?
另一答案
如果您只需要获取名称,则可以轻松使用BeautifulSoup
:
from bs4 import BeautifulSoup
s='''<domains>
<domain>
<name>zoznam.sk</name>
<description>Zoznam Slovakia</description>
<service>
<hostname>imap.zoznam.sk</hostname>
<port>143</port>
<protocol>IMAP</protocol>
<authentication>PLAIN</authentication>
<usernameIncludesDomain/>
</service>
<service>
<hostname>smtp.zoznam.sk</hostname>
<port>587</port>
<protocol>SMTP</protocol>
<authentication>PLAIN</authentication>
<usernameIncludesDomain/>
</service>
</domain>
<domain>
<name>123mail.org</name>
<description>123mail.org</description>
<service>
<hostname>imap.fastmail.com</hostname>
<port>993</port>
<protocol>IMAP</protocol>
<ssl/>
<requires/>
<authentication>PLAIN</authentication>
<usernameIncludesDomain/>
</service>
<service>
<hostname>smtp.fastmail.com</hostname>
<port>587</port>
<protocol>SMTP</protocol>
<ssl/>
<requires/>
<authentication>PLAIN</authentication>
<usernameIncludesDomain/>
</service>
</domain>
<domain>
<name>Netvigator.com</name>
<description>netvigator.com</description>
<service>
<hostname>corpmail1.netvigator.com</hostname>
<port>995</port>
<protocol>POP</protocol>
<ssl/>
<authentication>NONE</authentication>
<usernameIncludesDomain/>
</service>
<service>
<hostname>corpmail1.netvigator.com</hostname>
<port>587</port>
<protocol>SMTP</protocol>
<ssl/>
<authentication>NONE</authentication>
<usernameIncludesDomain/>
</service>
</domain>
</domains>'''
soup = BeautifulSoup(s, 'html.parser')
configs = [n.text for n in soup.find_all('name')]
你得到:
['zoznam.sk', '123mail.org', 'Netvigator.com']
要获取每项服务的信息,您可以添加以下代码:
soup = BeautifulSoup(s, 'html.parser')
configs = {}
services = soup.find_all('service')
for serv in services:
hostname = serv.find('hostname').text
configs[hostname] = {}
configs[hostname]['port'] = serv.find('port').text
configs[hostname]['protocol'] = serv.find('protocol').text
configs[hostname]['auth'] = serv.find('authentication').text
你得到configs
这是一本字典词典:
{'imap.zoznam.sk': {'port': '143', 'protocol': 'IMAP', 'auth': 'PLAIN'},
'smtp.zoznam.sk': {'port': '587', 'protocol': 'SMTP', 'auth': 'PLAIN'},
'imap.fastmail.com': {'port': '993', 'protocol': 'IMAP', 'auth': 'PLAIN'},
'smtp.fastmail.com': {'port': '587', 'protocol': 'SMTP', 'auth': 'PLAIN'},
'corpmail1.netvigator.com': {'port': '587', 'protocol': 'SMTP', 'auth': 'NONE'}}
以上是关于python3解析XML的主要内容,如果未能解决你的问题,请参考以下文章