Java对html标签的过滤和清洗

Posted 骑着龙的羊

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Java对html标签的过滤和清洗相关的知识,希望对你有一定的参考价值。

OWASP html Sanitizer 是一个简单快捷的java类库,主要用于放置XSS

优点如下:

  1.使用简单。不需要繁琐的xml配置,只用在代码中少量的编码

  2.由Mike Samuel(谷歌工程师)维护

  3.通过了AntiSamy超过95%的UT覆盖

  4.高性能,低内存消耗

  5.是 AntiSamy DOM性能的4倍

 

1.POM中增加

        <!--html标签过滤-->
        <dependency>
            <groupId>com.googlecode.owasp-java-html-sanitizer</groupId>
            <artifactId>owasp-java-html-sanitizer</artifactId>
            <version>r136</version>
        </dependency>

2.工具类

import org.owasp.html.ElementPolicy;
import org.owasp.html.HtmlPolicyBuilder;
import org.owasp.html.PolicyFactory;

import java.util.List;

/**
 * @author : RandySun
 * @date : 2018-10-08  10:32
 * Comment :
 */
public class HtmlUtils {

    //允许的标签
    private static final String[] allowedTags = {"h1", "h2", "h3", "h4", "h5", "h6",
            "span", "strong",
            "img", "video", "source",
            "blockquote", "p", "div",
            "ul", "ol", "li",
            "table", "thead", "caption", "tbody", "tr", "th", "td", "br",
            "a"
    };

    //需要转化的标签
    private static final String[] needTransformTags = {"article", "aside", "command","datalist","details","figcaption", "figure",
            "footer","header", "hgroup","section","summary"};

    //带有超链接的标签
    private static final String[] linkTags = {"img","video","source","a"};
    public static String sanitizeHtml(String htmlContent){
        PolicyFactory policy = new HtmlPolicyBuilder()
                //所有允许的标签
                .allowElements(allowedTags)
                //内容标签转化为div
                .allowElements( new ElementPolicy() {
                    @Override
                    public String apply(String elementName, List<String> attributes){
                        return "div";
                    }
                },needTransformTags)
                .allowAttributes("src","href","target").onElements(linkTags)
                //校验链接中的是否为http
                .allowUrlProtocols("https")
                .toFactory();
        String safeHTML = policy.sanitize(htmlContent);
        return safeHTML;
    }

    public static void main(String[] args){
        String inputHtml = "<img src="https://a.jpb"/>";
        System.out.println(sanitizeHtml(inputHtml));
    }
}

 其中.allowElements(allowedTags)是添加所有允许的html标签,

以下是需要转化的标签,把needTransformTags中的内容全部转化为div
//内容标签转化为div
.allowElements( new ElementPolicy() {
@Override
public String apply(String elementName, List<String> attributes){
return "div";
}
},needTransformTags)

 

.allowAttributes("src","href","target").onElements(linkTags)是在特定的标签上允许的属性

.allowUrlProtocols("https")表示href或者src链接中只允许https协议


 












以上是关于Java对html标签的过滤和清洗的主要内容,如果未能解决你的问题,请参考以下文章

java web过滤器实际应用(解决中文乱码 html标签转义功能 敏感字符过滤功能)

从 HTML 片段中删除空标签对

java web如何防止html,js注入

广告图片过滤

JSP基础

JSP 学习笔记