Python如何使用suds MessagePlugin和lxml仅解码xml中的特定部分

Posted 2023-02-24

技术标签:

【中文标题】Python如何使用suds MessagePlugin和lxml仅解码xml中的特定部分【英文标题】：Python how to decode only specific part in xml with using suds MessagePlugin and lxml 【发布时间】：2021-11-15 15:21:19 【问题描述】：

我正在从端点获取产品信息。为了解析这些信息，我使用了一个过滤器，即 suds MessagePlugin。

传入的数据如下：（即不包含空洞请求，只包含一小部分）

<env:Envelope xmlns:env='http://schemas.xmlsoap.org/soap/envelope/'><env:Header></env:Header><env:Body><prod:getProductsResponse xmlns:prod='https://product.individual.ns.listinsgapi.aa.com'><return><ackCode>success</ackCode><responseTime>13/09/2021 09:47:34</responseTime><timeElapsed>211 ms</timeElapsed><productCount>199</productCount><products><product><productId>01201801947</productId><product><categoryCode>cn1g</categoryCode><storeCategoryId>0</storeCategoryId><title>Morphy Richards Sensörlü çöp kutusu, 30 litre, yuvarlak, siyah paslanmaz çelik</title><specs><spec required="false" value="Standart Çöp Kovası" name="Ürün Tipi"/><spec required="false" value="Montajsız" name="Montaj Tipi"/><spec required="false" value="Sensörlü Kapak" name="Kapak Tipi"/><spec required="false" value="26 lt-30 lt" name="İç Hacim"/><spec required="false" value="Çelik" name="Malzeme"/><spec required="false" value="Sıfır" name="Durum"/></specs><photos><photo photoId="0"><url>https://mcdn301.gi1ttigidliyor.net/622080/620801947_0.jpg</url></photo><photo photoId="1"><url>https://mcdn011.gittigidliyor.net/620380/62081101947_1.jpg</url></photo><photo photoId="2"><url>https://mcdn021.gittigidliyor.net/620180/6210801947_2.jpg</url></photo><photo photoId="3"><url>https://mcdn201.gittigidliyor.net/620850/6208013947_3.jpg</url></photo><photo photoId="4"><url>https://mcdn301.gittigidliyor.net/623080/6208101947_4.jpg</url></photo><photo photoId="5"><url>https://mcdn01.gittigidiyor.net/62080/620801947_5.jpg</url></photo><photo photoId="6"><url>https://mcdn01.gittigidiyor.net/62080/620801947_6.jpg</url></photo></photos><pageTemplate>4</pageTemplate><description>&lt;body&gt;
 &lt;ul class=&quot;a-unordered-list a-vertical a-spacing-mini&quot; style=&quot;padding-right: 0px; padding-left: 0px; box-sizing: border-box; margin: 0px 0px 0px 18px; color: rgb(17, 17, 17); font-family: &quot;&gt; 
  &lt;li style=&quot;box-sizing: border-box; list-style: disc; overflow-wrap: break-word; margin: 0px;&quot;&gt;&amp;nbsp; &lt;h2 style=&quot;box-sizing: border-box; padding: 0px 0px 4px; margin: 3px 0px 7px; text-rendering: optimizelegibility; line-height: 32px; font-family: &quot;&gt;Ürün Bilgileri&lt;/h2&gt; &lt;span style=&quot;background-color:rgb(255, 255, 255); box-sizing:border-box; color:rgb(15, 17, 17); font-family:amazon ember,arial,sans-serif; font-size:14px&quot;&gt;Renk:&lt;strong style=&quot;box-sizing:border-box; font-weight:700&quot;&gt;Paslanmaz Çelik&lt;/strong&gt;&lt;/span&gt; 
   &lt;div class=&quot;a-row a-spacing-top-base&quot; style=&quot;box-sizing: border-box; width: 1213px; color: rgb(15, 17, 17); font-family: &quot;&gt; 
    &lt;div class=&quot;a-column a-span6&quot; style=&quot;box-sizing: border-box; margin-right: 24.25px; float: left; min-height: 1px; overflow: visible; width: 593.734px;&quot;&gt; 
     &lt;div class=&quot;a-row a-spacing-base&quot; style=&quot;box-sizing: border-box; width: 593.734px; margin-bottom: 12px !important;&quot;&gt; 
      &lt;div class=&quot;a-row a-expander-container a-expander-extend-container&quot; style=&quot;box-sizing: border-box; width: 593.734px;&quot;&gt; 
       &lt;div class=&quot;a-row&quot; style=&quot;box-sizing: border-box; width: 593.734px;&quot;&gt;

我只想将html解码应用于信息的描述部分。因为某些产品的描述部分由于某种原因在传入的信息中没有完全解析html标签而出现错误。

例如：

0979c08d37cd.CR0,0,2000,2000_PT0_SX220_.jpg 样式=-webkit-tap-highlight-color：透明；边框：无；框尺寸：边框框；显示：块；边距：0px 自动；最大宽度：100%；填充：0px；垂直对齐：顶部/p /div /th /tr /tbody /table /div /div /div /div /div /div /div /div /div /body

为了解决这个问题，我尝试了 2 种不同的方法。

在深入研究方法之前：

context：回复上下文。 Ireply 是原始文本。 context.reply = 传入数据类型(context.reply) = Bytes

class UnicodeFilter(MessagePlugin):

    def received(self, context):
        from lxml import etree
        from io import BytesIO

        parser = etree.XMLParser(recover=True)
        request_string = context.reply.decode("utf-8")
        replaced_string = request_string.replace("&gt;", ">").replace("&lt;", "<")

        byte_rep_string = str.encode(replaced_string)
      
        doc = etree.parse(BytesIO(byte_rep_string), parser)
        byte_str_doc = etree.tostring(doc)
        context.reply = byte_str_doc

这种方法不起作用。它没有抛出错误，但描述正文中的html标签没有变化，产品描述部分的html标签仍然损坏。

class UnicodeFilter(MessagePlugin):

    def received(self, context):
        from lxml import etree
        from io import BytesIO
        import html

        parser = etree.XMLParser(recover=True) # Initialize the parser
        request_string = context.reply.decode("utf-8") # Converting incoming data byte to string
        html_decoded = html.unescape(request_string) # Html decoding to the data
        byte_rep_string = str.encode(html_decoded) # Converting the data from string to byte
      
        doc = etree.parse(BytesIO(byte_rep_string), parser)
        byte_str_doc = etree.tostring(doc)
        context.reply = byte_str_doc

在这种方法中，我得到了 TypeNotFound: Type not found: 'body' 错误。

总结我想做的事情。我想使用 lxml 库解析传入的数据，因为数据中的某些字符可能会导致问题，并且我收到“格式不正确（无效令牌）”错误。（我解决了这个问题）。其次，我只想对这些数据的描述部分进行 html 解码。（修复html标签问题）

任何帮助都会很棒。

【问题讨论】：

请提供正确的minimal reproducible example。重现问题真的需要继承自 MessagePlugin 的类吗？ “传入数据”样本格式不正确（缺少结束标签）。 @mzjn 实际上，这就是我一开始收到的错误消息。但是我通过使用过滤（suds MessagePlugin）解决了这个问题。给定的传入数据不是完整数据，它是原始数据的样本，原始数据是一个巨大的文件。所以，我的问题是关于 html 解码部分。（您可以在传入数据示例中看到标记下方的 html 部分）在使用 html.unescape(request_string) 时，由于某种原因它会引发错误（TypeNotFound: Type not found: 'body' error）。问题是如何分别解析即将到来的数据及其内部 decsp 标记。这个问题很难重现，但我试图给出更详细的解释。基本上你会得到一些像 schemas.xmlsoap.org/soap/envelope/…> 这样的 xml 数据......并且该数据包含一些描述部分，并且在该描述标签中具有带有 html 标签的 html 正文。所以我想分别解析xml数据和内部描述部分。我使用parser = etree.XMLParser(recover=True) 来解析即将到来的xml 数据和html_decoded = html.unescape(request_string) 来解析html 部分。 【参考方案1】：

我不确定能否重现您的特定错误，但一旦您获得请求中的字符串，我将使用 etree.fromstring() 的这种方法。（我已尝试清理并关闭测试数据的标签，以便对其进行解析以演示解决方案。其中还有一个额外的 <product> 标签可以防止您可能必须处理的解析。）


In [104]: import lxml

In [105]: string = '''<env:Envelope xmlns:env='http://schemas.xmlsoap.org/soap/envelope/'><env:Header></env:Header><env:Body><pr
     ...: od:getProductsResponse xmlns:prod='https://product.individual.ns.listinsgapi.aa.com'><return><ackCode>success</ackCode
     ...: ><responseTime>13/09/2021 09:47:34</responseTime><timeElapsed>211 ms</timeElapsed><productCount>199</productCount><pro
     ...: ducts><product><productId>01201801947</productId><categoryCode>cn1g</categoryCode><storeCategoryId>0</storeCategoryId>
     ...: <title>Morphy Richards Sensörlü çöp kutusu, 30 litre, yuvarlak, siyah paslanmaz çelik</title><specs><spec required="fa
     ...: lse" value="Standart Çöp Kovası" name="Ürün Tipi"/><spec required="false" value="Montajsız" name="Montaj Tipi"/><spec 
     ...: required="false" value="Sensörlü Kapak" name="Kapak Tipi"/><spec required="false" value="26 lt-30 lt" name="İç Hacim"/
     ...: ><spec required="false" value="Çelik" name="Malzeme"/><spec required="false" value="Sıfır" name="Durum"/></specs><phot
     ...: os><photo photoId="0"><url>https://mcdn301.gi1ttigidliyor.net/622080/620801947_0.jpg</url></photo><photo photoId="1"><
     ...: url>https://mcdn011.gittigidliyor.net/620380/62081101947_1.jpg</url></photo><photo photoId="2"><url>https://mcdn021.gi
     ...: ttigidliyor.net/620180/6210801947_2.jpg</url></photo><photo photoId="3"><url>https://mcdn201.gittigidliyor.net/620850/
     ...: 6208013947_3.jpg</url></photo><photo photoId="4"><url>https://mcdn301.gittigidliyor.net/623080/6208101947_4.jpg</url><
     ...: /photo><photo photoId="5"><url>https://mcdn01.gittigidiyor.net/62080/620801947_5.jpg</url></photo><photo photoId="6"><
     ...: url>https://mcdn01.gittigidiyor.net/62080/620801947_6.jpg</url></photo></photos><pageTemplate>4</pageTemplate><descrip
     ...: tion>&lt;body&gt;
     ...:  &lt;ul class=&quot;a-unordered-list a-vertical a-spacing-mini&quot; style=&quot;padding-right: 0px; padding-left: 0px
     ...: ; box-sizing: border-box; margin: 0px 0px 0px 18px; color: rgb(17, 17, 17); font-family: &quot;&gt; 
     ...:   &lt;li style=&quot;box-sizing: border-box; list-style: disc; overflow-wrap: break-word; margin: 0px;&quot;&gt;&amp;n
     ...: bsp; &lt;h2 style=&quot;box-sizing: border-box; padding: 0px 0px 4px; margin: 3px 0px 7px; text-rendering: optimizeleg
     ...: ibility; line-height: 32px; font-family: &quot;&gt;Ürün Bilgileri&lt;/h2&gt; &lt;span style=&quot;background-color:rgb
     ...: (255, 255, 255); box-sizing:border-box; color:rgb(15, 17, 17); font-family:amazon ember,arial,sans-serif; font-size:14
     ...: px&quot;&gt;Renk:&lt;strong style=&quot;box-sizing:border-box; font-weight:700&quot;&gt;Paslanmaz Çelik&lt;/strong&gt;
     ...: &lt;/span&gt; 
     ...:    &lt;div class=&quot;a-row a-spacing-top-base&quot; style=&quot;box-sizing: border-box; width: 1213px; color: rgb(15
     ...: , 17, 17); font-family: &quot;&gt; 
     ...:     &lt;div class=&quot;a-column a-span6&quot; style=&quot;box-sizing: border-box; margin-right: 24.25px; float: left;
     ...:  min-height: 1px; overflow: visible; width: 593.734px;&quot;&gt; 
     ...:      &lt;div class=&quot;a-row a-spacing-base&quot; style=&quot;box-sizing: border-box; width: 593.734px; margin-botto
     ...: m: 12px !important;&quot;&gt; 
     ...:       &lt;div class=&quot;a-row a-expander-container a-expander-extend-container&quot; style=&quot;box-sizing: border-
     ...: box; width: 593.734px;&quot;&gt; 
     ...:        &lt;div class=&quot;a-row&quot; style=&quot;box-sizing: border-box; width: 593.734px;&quot;&gt;
     ...: </description>
     ...: </product>
     ...: </products>
     ...: </return>
     ...: </prod:getProductsResponse>
     ...: </env:Body>
     ...: </env:Envelope>'''

In [106]: root = lxml.etree.fromstring(string)

In [108]: descriptions = root.xpath('//description')

In [109]: description = descriptions[0]

In [110]: description.text
Out[110]: '<body>\n <ul class="a-unordered-list a-vertical a-spacing-mini" style="padding-right: 0px; padding-left: 0px; box-sizing: border-box; margin: 0px 0px 0px 18px; color: rgb(17, 17, 17); font-family: "> \n  <li style="box-sizing: border-box; list-style: disc; overflow-wrap: break-word; margin: 0px;">&nbsp; <h2 style="box-sizing: border-box; padding: 0px 0px 4px; margin: 3px 0px 7px; text-rendering: optimizelegibility; line-height: 32px; font-family: ">Ürün Bilgileri</h2> <span style="background-color:rgb(255, 255, 255); box-sizing:border-box; color:rgb(15, 17, 17); font-family:amazon ember,arial,sans-serif; font-size:14px">Renk:<strong style="box-sizing:border-box; font-weight:700">Paslanmaz Çelik</strong></span> \n   <div class="a-row a-spacing-top-base" style="box-sizing: border-box; width: 1213px; color: rgb(15, 17, 17); font-family: "> \n    <div class="a-column a-span6" style="box-sizing: border-box; margin-right: 24.25px; float: left; min-height: 1px; overflow: visible; width: 593.734px;"> \n     <div class="a-row a-spacing-base" style="box-sizing: border-box; width: 593.734px; margin-bottom: 12px !important;"> \n      <div class="a-row a-expander-container a-expander-extend-container" style="box-sizing: border-box; width: 593.734px;"> \n       <div class="a-row" style="box-sizing: border-box; width: 593.734px;">\n'

In [112]: html_root = lxml.etree.fromstring(description.text, lxml.etree.HTMLParser())

In [114]: lxml.etree.tostring(html_root)
Out[114]: b'<html><body>\n <ul class="a-unordered-list a-vertical a-spacing-mini" style="padding-right: 0px; padding-left: 0px; box-sizing: border-box; margin: 0px 0px 0px 18px; color: rgb(17, 17, 17); font-family: "> \n  <li style="box-sizing: border-box; list-style: disc; overflow-wrap: break-word; margin: 0px;">&#160; <h2 style="box-sizing: border-box; padding: 0px 0px 4px; margin: 3px 0px 7px; text-rendering: optimizelegibility; line-height: 32px; font-family: ">&#220;r&#252;n Bilgileri</h2> <span style="background-color:rgb(255, 255, 255); box-sizing:border-box; color:rgb(15, 17, 17); font-family:amazon ember,arial,sans-serif; font-size:14px">Renk:<strong style="box-sizing:border-box; font-weight:700">Paslanmaz &#199;elik</strong></span> \n   <div class="a-row a-spacing-top-base" style="box-sizing: border-box; width: 1213px; color: rgb(15, 17, 17); font-family: "> \n    <div class="a-column a-span6" style="box-sizing: border-box; margin-right: 24.25px; float: left; min-height: 1px; overflow: visible; width: 593.734px;"> \n     <div class="a-row a-spacing-base" style="box-sizing: border-box; width: 593.734px; margin-bottom: 12px !important;"> \n      <div class="a-row a-expander-container a-expander-extend-container" style="box-sizing: border-box; width: 593.734px;"> \n       <div class="a-row" style="box-sizing: border-box; width: 593.734px;">\n</div></div></div></div></div></li></ul></body></html>'

如果您在此之后需要操作 html，最好操作 html_root 而不是尝试操作字符串。如果是这样，我可以根据需要扩展答案。

【讨论】：

首先感谢您的回复。在 root = etree.fromstring(string_context_reply) 那部分我得到一个错误，它是 lxml.etree.XMLSyntaxError: PCDATA invalid Char value。我收到该错误的原因可能是因为您的“字符串”值是我的情况下的字节对象，我使用 string_context_reply = context.reply.decode("UTF-8") 将其转换为字符串对象然后我用作 root = etree.fromstring(string_context_reply)

以上是关于Python如何使用suds MessagePlugin和lxml仅解码xml中的特定部分的主要内容，如果未能解决你的问题，请参考以下文章