如何从 Response.xpath 中排除特定标签（<br>）？

Posted 2023-02-21

技术标签:

【中文标题】如何从 Response.xpath 中排除特定标签（<br>）？【英文标题】：How Can I Exclude specific Tag(<br>) from Response.xpath? 【发布时间】：2021-09-03 00:34:20 【问题描述】：

下面是一些示例源 html，我想获取一个字符串（或列表）结果。

<font class="news">
    <table border="0" cellspacing="0" cellpadding="0" align="right">
        <tr>
            <td style="padding-left:10px; padding-bottom:5px;">
                <a href="../1.jpg" target="_blank" onfocus='this.blur()'>
                    <img src="../pic1/small_16239927831.jpg"  >
                </a>
            </td>
        </tr>
    </table>
    AAA<br><br>
    BBB<br><br>
    CCC<br>
</font>

我可以得到一些结果

response.xpath('//font[@class="body_news"]/text()')

或

response.xpath('//font[@class="body_news"]/text()').extract()

但是，结果有很多 \n 或 \n\t ，我只想得到 "AAA BBB CCC" 或 ['AAA','BBB','CCC'] 。

我也用过normalize-space()，但是不行。如何排除这些换行符或制表符？

['AAA', '\n\t\t', '\n\n\t\t', 'BBB', '\n\t\t', 'CCC', '\n\t' ]

【问题讨论】：

您的问题格式不正确。规范化空间应该可以完成这项工作。你能分享源代码吗？ 【参考方案1】：

这个 XPath：

normalize-space(//font[@class='news'])

给出这个结果：

AAA BBB CCC

【讨论】：

这回答了你的问题吗？

以上是关于如何从 Response.xpath 中排除特定标签（<br>）？的主要内容，如果未能解决你的问题，请参考以下文章

Scrapy XPath语法

如何从通过 Jenkins 使用 Octopack 构建的 NuGet 包中排除目录和文件？