HTML Agility Pack 条带标签不在白名单中
Posted
技术标签:
【中文标题】HTML Agility Pack 条带标签不在白名单中【英文标题】:HTML Agility Pack strip tags NOT IN whitelist 【发布时间】:2011-03-07 15:45:29 【问题描述】:我正在尝试创建一个函数来删除不在白名单中的 html 标记和属性。 我有以下 HTML:
<b>first text </b>
<b>second text here
<a>some text here</a>
<a>some text here</a>
</b>
<a>some twxt here</a>
我正在使用 HTML 敏捷包,目前我的代码是:
static List<string> WhiteNodeList = new List<string> "b" ;
static List<string> WhiteAttrList = new List<string> ;
static HtmlNode htmlNode;
public static void RemoveNotInWhiteList(out string _output, HtmlNode pNode, List<string> pWhiteList, List<string> attrWhiteList)
// remove all attributes not on white list
foreach (var item in pNode.ChildNodes)
item.Attributes.Where(u => attrWhiteList.Contains(u.Name) == false).ToList().ForEach(u => RemoveAttribute(u));
// remove all html and their innerText and attributes if not on whitelist.
//pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.Remove());
//pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.ParentNode.ReplaceChild(ConvertHtmlToNode(u.InnerHtml),u));
//pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.Remove());
for (int i = 0; i < pNode.ChildNodes.Count; i++)
if (!pWhiteList.Contains(pNode.ChildNodes[i].Name))
HtmlNode _newNode = ConvertHtmlToNode(pNode.ChildNodes[i].InnerHtml);
pNode.ChildNodes[i].ParentNode.ReplaceChild(_newNode, pNode.ChildNodes[i]);
if (pNode.ChildNodes[i].HasChildNodes && !string.IsNullOrEmpty(pNode.ChildNodes[i].InnerText.Trim().Replace("\r\n", "")))
HtmlNode outputNode1 = pNode.ChildNodes[i];
for (int j = 0; j < pNode.ChildNodes[i].ChildNodes.Count; j++)
string _childNodeOutput;
RemoveNotInWhiteList(out _childNodeOutput,
pNode.ChildNodes[i], WhiteNodeList, WhiteAttrList);
pNode.ChildNodes[i].ReplaceChild(ConvertHtmlToNode(_childNodeOutput), pNode.ChildNodes[i].ChildNodes[j]);
i++;
// Console.WriteLine(pNode.OuterHtml);
_output = pNode.OuterHtml;
private static void RemoveAttribute(HtmlAttribute u)
u.Value = u.Value.ToLower().Replace("javascript", "");
u.Remove();
public static HtmlNode ConvertHtmlToNode(string html)
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode.ChildNodes.Count == 1)
return doc.DocumentNode.ChildNodes[0];
else return doc.DocumentNode;
我想达到的输出是
<b>first text </b>
<b>second text here
some text here
some text here
</b>
some twxt here
这意味着我只想保留<b>
标签。
我这样做的原因是因为一些用户将 MS WORD 中的 cpoy-paste 粘贴到 ny WYSYWYG html 编辑器中。
谢谢!
【问题讨论】:
【参考方案1】:感谢您的代码 - 太棒了!!!!
我做了一些优化...
class TagSanitizer
List<HtmlNode> _deleteNodes = new List<HtmlNode>();
public static void Sanitize(HtmlNode node)
new TagSanitizer().Clean(node);
void Clean(HtmlNode node)
CleanRecursive(node);
for (int i = _deleteNodes.Count - 1; i >= 0; i--)
HtmlNode nodeToDelete = _deleteNodes[i];
nodeToDelete.ParentNode.RemoveChild(nodeToDelete, true);
void CleanRecursive(HtmlNode node)
if (node.NodeType == HtmlNodeType.Element)
if (Config.TagsWhiteList.ContainsKey(node.Name) == false)
_deleteNodes.Add(node);
else if (node.HasAttributes)
for (int i = node.Attributes.Count - 1; i >= 0; i--)
HtmlAttribute currentAttribute = node.Attributes[i];
string[] allowedAttributes = Config.TagsWhiteList[node.Name];
if (allowedAttributes != null)
if (allowedAttributes.Contains(currentAttribute.Name) == false)
node.Attributes.Remove(currentAttribute);
else
node.Attributes.Remove(currentAttribute);
if (node.HasChildNodes)
node.ChildNodes.ToList().ForEach(v => CleanRecursive(v));
【讨论】:
这一行的 Config 是什么? if (Config.TagsWhiteList.ContainsKey(node.Name) == false) 这只是另一个列表,您可以随意更改:) 附带说明,当我尝试这样做时,我遇到了结果标记不一致的问题(部分乱序,并非所有格式都被正确剥离),这可能是由于多线程优化递归。 是的,这个 sn-p 不支持多任务处理 到目前为止,这个答案对我有用。接受的答案在我部署它的服务器上的 StripHtml 方法中不断抛出“对象引用未设置为对象的实例”。事实证明这太难调试了,因为它不会在我的本地环境中引发错误。【参考方案2】:嘿,显然我几乎在某人的博客文章中找到了答案....
using System.Collections.Generic;
using System.Linq;
using HtmlAgilityPack;
namespace Wayloop.Blog.Core.Markup
public static class HtmlSanitizer
private static readonly IDictionary<string, string[]> Whitelist;
static HtmlSanitizer()
Whitelist = new Dictionary<string, string[]>
"a", new[] "href" ,
"strong", null ,
"em", null ,
"blockquote", null ,
;
public static string Sanitize(string input)
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(input);
SanitizeNode(htmlDocument.DocumentNode);
return htmlDocument.DocumentNode.WriteTo().Trim();
private static void SanitizeChildren(HtmlNode parentNode)
for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--)
SanitizeNode(parentNode.ChildNodes[i]);
private static void SanitizeNode(HtmlNode node)
if (node.NodeType == HtmlNodeType.Element)
if (!Whitelist.ContainsKey(node.Name))
node.ParentNode.RemoveChild(node);
return;
if (node.HasAttributes)
for (int i = node.Attributes.Count - 1; i >= 0; i--)
HtmlAttribute currentAttribute = node.Attributes[i];
string[] allowedAttributes = Whitelist[node.Name];
if (!allowedAttributes.Contains(currentAttribute.Name))
node.Attributes.Remove(currentAttribute);
if (node.HasChildNodes)
SanitizeChildren(node);
I got HtmlSanitizer from here 显然它并没有去除标签,而是完全删除了元素。
好的,这是以后需要的人的解决方案。
public static class HtmlSanitizer
private static readonly IDictionary<string, string[]> Whitelist;
private static List<string> DeletableNodesXpath = new List<string>();
static HtmlSanitizer()
Whitelist = new Dictionary<string, string[]>
"a", new[] "href" ,
"strong", null ,
"em", null ,
"blockquote", null ,
"b", null,
"p", null,
"ul", null,
"ol", null,
"li", null,
"div", new[] "align" ,
"strike", null,
"u", null,
"sub", null,
"sup", null,
"table", null ,
"tr", null ,
"td", null ,
"th", null
;
public static string Sanitize(string input)
if (input.Trim().Length < 1)
return string.Empty;
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(input);
SanitizeNode(htmlDocument.DocumentNode);
string xPath = HtmlSanitizer.CreateXPath();
return StripHtml(htmlDocument.DocumentNode.WriteTo().Trim(), xPath);
private static void SanitizeChildren(HtmlNode parentNode)
for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--)
SanitizeNode(parentNode.ChildNodes[i]);
private static void SanitizeNode(HtmlNode node)
if (node.NodeType == HtmlNodeType.Element)
if (!Whitelist.ContainsKey(node.Name))
if (!DeletableNodesXpath.Contains(node.Name))
//DeletableNodesXpath.Add(node.Name.Replace("?",""));
node.Name = "removeableNode";
DeletableNodesXpath.Add(node.Name);
if (node.HasChildNodes)
SanitizeChildren(node);
return;
if (node.HasAttributes)
for (int i = node.Attributes.Count - 1; i >= 0; i--)
HtmlAttribute currentAttribute = node.Attributes[i];
string[] allowedAttributes = Whitelist[node.Name];
if (allowedAttributes != null)
if (!allowedAttributes.Contains(currentAttribute.Name))
node.Attributes.Remove(currentAttribute);
else
node.Attributes.Remove(currentAttribute);
if (node.HasChildNodes)
SanitizeChildren(node);
private static string StripHtml(string html, string xPath)
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
if (xPath.Length > 0)
HtmlNodeCollection invalidNodes = htmlDoc.DocumentNode.SelectNodes(@xPath);
foreach (HtmlNode node in invalidNodes)
node.ParentNode.RemoveChild(node, true);
return htmlDoc.DocumentNode.WriteContentTo(); ;
private static string CreateXPath()
string _xPath = string.Empty;
for (int i = 0; i < DeletableNodesXpath.Count; i++)
if (i != DeletableNodesXpath.Count - 1)
_xPath += string.Format("//0|", DeletableNodesXpath[i].ToString());
else _xPath += string.Format("//0", DeletableNodesXpath[i].ToString());
return _xPath;
我重命名了节点,因为如果我必须解析 XML 命名空间节点,它会在 xpath 解析时崩溃。
【讨论】:
HtmlSanitizer 的链接已损坏。这可能是 Meltdown 所指的代码:gist.github.com/814428 这绝不是我创建白名单验证类的代码。原作者没有使用 RegEx。作者原代码是我贴的第一段代码。 此代码不起作用,我可以轻松保存带有提交按钮的表单以及包含有害代码的脚本部分。 有 2 个类似的项目,github.com/Vereyon/HtmlRuleSanitizer 和 github.com/mganss/HtmlSanitizer。后者通过示例链接到owasp.org/index.php/.NET_AntiXSS_Library 请注意,DeletableNodesXpath
将始终随着上面的代码不断增长。它总是将"removableNode"
添加到列表中,并且永远不会匹配(因为它正在查看一个充满“removableNode”的列表)以上是关于HTML Agility Pack 条带标签不在白名单中的主要内容,如果未能解决你的问题,请参考以下文章
使用 HTML Agility Pack 替换 HTML div InnerText 标签
如何使用 HTML Agility Pack 修复格式错误的 HTML?