使用nokogiri剥离样式属性

Posted 2021-05-03

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了使用nokogiri剥离样式属性相关的知识，希望对你有一定的参考价值。

我正在使用nokogiri删除一个html页面，我想删除所有样式属性。我怎样才能做到这一点？（我不使用rails所以我不能使用它的清理方法，我不想使用sanitize gem'因为我想黑名单删除而不是白名单）

html = open(url)
doc = Nokogiri::HTML(html.read)
doc.css('.post').each do |post|
puts post.to_s
end

=> <p><span style="font-size: x-large">bla bla <a href="http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/">statistica</a> blabla</span></p>

我想要它

=> <p><span>bla bla <a href="http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/">statistica</a> blabla</span></p>

答案

require 'nokogiri'

html = '<p class="post"><span style="font-size: x-large">bla bla</span></p>'
doc = Nokogiri::HTML(html)
doc.xpath('//@style').remove
puts doc.css('.post')
#=> <p class="post"><span>bla bla</span></p>

编辑表明你可以调用NodeSet#remove而不必使用.each(&:remove)。

请注意，如果你有一个DocumentFragment而不是Document，Nokogiri有a longstanding bug，从片段中搜索不能像你期望的那样工作。解决方法是使用：

doc.xpath('@style|.//@style').remove

另一答案

这适用于文档和文档片段：

doc = Nokogiri::HTML::DocumentFragment.parse(...)

要么

doc = Nokogiri::HTML(...)

要删除所有“样式”属性，您可以执行

doc.css('*').remove_attr('style')

另一答案

我尝试了Phrogz的答案，但无法让它工作（虽然我使用的是文档片段，但我认为它应该工作相同？）。

开头的“//”似乎没有按照我的预期检查所有节点。最后我做了一些更长时间的啰嗦，但它确实有效，所以这里的记录以防万一其他人有同样的麻烦是我的解决方案（虽然它很脏）：

doc = Nokogiri::HTML::Document.new
body_dom = doc.fragment( my_html )

# strip out any attributes we don't want
body_dom.xpath( './/*[@align]|*[@align]' ).each do |tag|
    tag.attributes["align"].remove
end

以上是关于使用nokogiri剥离样式属性的主要内容，如果未能解决你的问题，请参考以下文章

如何在 IE9 剥离之前获取样式属性值

如何使用 Nokogiri 访问属性

Wordpress 从文档中剥离 <style> 标签

如何使用 Nokogiri 解析 HTML 表格？

Angular：清理 HTML 剥离了一些 CSS 样式的内容

Google 跟踪代码管理器 - 将 CSP 随机数添加到自定义 HTML 代码段的任何可能方式？脚本属性被剥离