清理 HTML 并关闭不完整的标签

Posted 2023-02-19

技术标签:

【中文标题】清理 HTML 并关闭不完整的标签【英文标题】：Sanitize HTML and close incomplete tags 【发布时间】：2012-04-13 13:40:06 【问题描述】：

ApplicationHelper 中的sanitize() 不会关闭标签。

s = "<a href='http://example.com'>incomplete"
sanitize(s, :tags => ['a', 'p'])

上面的 sn-p 保持字符串不变。我怎么能强制它附加一个结束 </a> 或至少完全剥离 <a>？

【问题讨论】：

【参考方案1】：

您可以使用适当的 html 解析器来执行此操作。我推荐 Nokogiri 做这份工作：

require 'nokogiri'
# ...
s = "<a href='http://example.com'>incomplete"
Nokogiri::HTML::fragment(sanitize(s, :tags => ['a', 'p'])).to_xml
# => "<a href=\"http://example.com\">incomplete</a>"

这将始终返回有效的 XML。当然，您可以将其打包到您自己的辅助方法中以便于使用。

【讨论】：

谢谢，但我看到 TypeError: can't convert Symbol into Integer 对此作出回应，它适用于纯文本。这是 Nokogiri 1.5.2。 @mahemoff: Nokogiri::HTML::fragment("<a href='http://example.com'>incomplete").to_xml 在这里工作正常。您尝试的实际标签汤是什么？实际上，它看起来像是要清理的第二个参数。与原始问题一样，允许的标签需要在散列中，键入：tags =>。 Nokogiri::HTML::fragment(sanitize('test <a href="http://example.com">incomplete', :tags => ['a', 'p'])).to_xml 确实有效。顺便说一句，我刚刚注意到，在处理 i18n 时，to_xml 可能比to_html 更好。后者正在转义像这样的 unicode 实体。【参考方案2】：

更新后的答案是

 s = "<a href='http://example.com'>incomplete"
 html = sanitize(s, tags: %w[a p])
 Nokogiri::HTML::DocumentFragment.parse(html).to_html

【讨论】：

最后一行对我来说效果很好，也关闭了未关闭的标签。

以上是关于清理 HTML 并关闭不完整的标签的主要内容，如果未能解决你的问题，请参考以下文章