java xml转义方法以及中文字符的处理

Posted 2022-12-09 秋风兮月

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了java xml转义方法以及中文字符的处理相关的知识，希望对你有一定的参考价值。

对于xml的转义最方便，最简单的方法就是直接使用apache的commons.lang jar包中的StringEscapeUtils的escapeXml方法。但该方法在commons lang 2.x和commons lang 3.x的处理方式不太一样。

在commons lang 2.x中StringEscapeUtils的escapeXml方法除了会对xml中的“，&，<，>和‘等字符进行转义外，还会对unicode编码大于0x7F的字符进行转义。

在StringEscapeUtils中创建了xml Entities对象。在该对象中添加了了BASIC_ARRAY和APOS_ARRAY中定义的字符，如果碰到这些字符就会进行转义。

BASIC_ARRAY中定义了

private static final String[][] BASIC_ARRAY = "quot", "34", // " - double-quote
        "amp", "38", // & - ampersand
        "lt", "60", // < - less-than
        "gt", "62", // > - greater-than
    ;

APOS_ARRAY中定义了

private static final String[][] APOS_ARRAY = "apos", "39", // XML apostrophe
    ;

因此会对这些定义的字符进行转义。escapeXml方法调用Entities.XML.escape的方法进行转义的具体操作

public void escape(Writer writer, String str) throws IOException 
        int len = str.length();
        for (int i = 0; i < len; i++) 
            char c = str.charAt(i);
            String entityName = this.entityName(c);
            if (entityName == null) 
                if (c > 0x7F) 
                    writer.write("&#");
                    writer.write(Integer.toString(c, 10));
                    writer.write(';');
                 else 
                    writer.write(c);
                
             else 
                writer.write('&');
                writer.write(entityName);
                writer.write(';');

可以看出还对Unicode编码大于ox7F的字符进行了转义。因此使用该方法会使得中文字符也会被转义。

如果不想使用中文字符被转义，要么自己可以参考上面的代码，自己改写，去掉对大于0x7F的字符的转义，要么可以使用commons lang3中的escapeXml相关方法。commons lang3中对方法使用策略模式进行了重新设计。相关的方法有escapeXml、escapeXml10和escapeXml11。

其中escapeXml方法已经被废弃。该方法只转义xml中的“，&，<，>和‘5个字符进行转义。将new LookupTranslator(EntityArrays.BASIC_ESCAPE())和new LookupTranslator(EntityArrays.APOS_ESCAPE())两个Tranlator注册到ESCAPE_XML上

escapeXml10方法除了对上述5个字符进行转义外，还会将一些控制字符，例如\\b、\\t、\\n、\\r等等替换成空字符串。因为XML1.0是纯文本格式，不能表示控制字符。另外对于不成对的代理码点也不能表示，因此会去除掉。因此注册到escapeXml10的Translator除了new LookupTranslator(EntityArrays.BASIC_ESCAPE())和new LookupTranslator(EntityArrays.APOS_ESCAPE())外，还有

new LookupTranslator(
            new String[][]
                    "\\u0000", "" , "\\u0001", "" , "\\u0002", "" , "\\u0003", "" , "\\u0004", "" , "\\u0005", "" , "\\u0006", "" , "\\u0007", "" , "\\u0008", "" ,
                    "\\u000b", "" , "\\u000c", "" , "\\u000e", "" , "\\u000f", "" , "\\u0010", "" , "\\u0011", "" , "\\u0012", "" , "\\u0013", "" , "\\u0014", "" ,
                    "\\u0015", "" , "\\u0016", "" , "\\u0017", "" , "\\u0018", "" , "\\u0019", "" , "\\u001a", "" , "\\u001b", "" , "\\u001c", "" , "\\u001d", "" ,
                    "\\u001e", "" , "\\u001f", "" , "\\ufffe", "" , "\\uffff", ""
            ),
    和
    new UnicodeUnpairedSurrogateRemover()。

一个是用来处理控制字符，一个是用来处理未成对的代理码点,移除掉码值在[#xD8000,#xDFFF]之间的码值字符。也就是escapeXml10会移除不在下面码值范围内的所有码值：

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]。

另外escapeXml10还注册了NumericEntityEscaper.between(0x7f, 0x84)和NumericEntityEscaper.between(0x86, 0x9f)两个Translator，将[#x7F-#x84] | [#x86-#x9F]两个范围内的字符进行转义。

对于escapeXml11，由于XML 1.1可以表示一定的控制字符，所以对于控制字符的Translator和escapeXml10不太相同。

new LookupTranslator(
    new String[][]
            "\\u0000", "" ,
            "\\u000b", "" ,
            "\\u000c", "" ,
            "\\ufffe", "" ,
            "\\uffff", ""
)

escapeXml11将会移除不在下面码值范围内的所有码值：

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

escapeXml11还注册了

NumericEntityEscaper.between(0x1, 0x8),
NumericEntityEscaper.between(0xe, 0x1f),
NumericEntityEscaper.between(0x7f, 0x84),
NumericEntityEscaper.between(0x86, 0x9f),

四个Translator，这样将会对在#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]范围内的码值进行转义。

所使用的主要函数就是这三个。下面说一下其大概的一个工作原理。

对于这三个函数都分别使用了不同的Translator。不过都是AggregateTranslator类的对象。从这个类的名字也可以看出这是个集成Translator，作用就是调用其中注册的一组Translator。所有的Translator都继承自CharSequenceTranslator抽象类，在转义方法中都是直接调用了CharSequenceTranslator的

/**
     * Helper for non-Writer usage. 
     * @param input CharSequence to be translated
     * @return String output of translation
     */
    public final String translate(final CharSequence input) 
        if (input == null) 
            return null;
        
        try 
            final StringWriter writer = new StringWriter(input.length() * 2);
            translate(input, writer);
            return writer.toString();
         catch (final IOException ioe) 
            // this should never ever happen while writing to a StringWriter
            throw new RuntimeException(ioe);

方法，这个方法又调用了

      /**
     * Translate an input onto a Writer. This is intentionally final as its algorithm is 
     * tightly coupled with the abstract method of this class. 
      *
     * @param input CharSequence that is being translated
     * @param out Writer to translate the text to
     * @throws IOException if and only if the Writer produces an IOException
     */
    public final void translate(final CharSequence input, final Writer out) throws IOException 
        if (out == null) 
            throw new IllegalArgumentException("The Writer must not be null");
        
        if (input == null) 
            return;
        
        int pos = 0;
        final int len = input.length();
        while (pos < len) 
		//从pos位置开始，对该位置开始的字符进行遍历转义，并返回转义的代码点的个数。注意是代码点，而不是char的个数或者代码单元的个数，
		//这个函数在CharSequenceTranslator是个虚函数，需要各继承类实现。并约定每个继承类需要处理码值代理对
		//关于码值代理对的概念，可以参考我的另一篇博文“java char String中涉及到的length字符长度概念的研究”
            final int consumed = translate(input, pos, out);
            if (consumed == 0)    //说明调用的traslator没有需要处理的转移字符
                // inlined implementation of Character.toChars(Character.codePointAt(input, pos))
                // avoids allocating temp char arrays and duplicate checks
                char c1 = input.charAt(pos);
                out.write(c1);
                pos++;
		    //如果当前位置是个代理对码值，那么就需要把该辅助字符的第一和第二部分同时处理输出
                if (Character.isHighSurrogate(c1) && pos < len) 
                    char c2 = input.charAt(pos);
                    if (Character.isLowSurrogate(c2)) 
                      out.write(c2);
                      pos++;
                    
                
                continue;
            
            // contract with translators is that they have to understand codepoints
            // and they just took care of a surrogate pair
		//consumed应该表示的是代码点的数量，因此需要获取当前位置的代码点的代码单元的个数，然后将pos指向需要处理的下一个代码点
            for (int pt = 0; pt < consumed; pt++) 
                pos += Character.charCount(Character.codePointAt(input, pos));

该方法又调用了方法

	/**
	* Translate a set of codepoints, represented by an int index into a CharSequence, 
	* into another set of codepoints. The number of codepoints consumed must be returned, 
	* and the only IOExceptions thrown must be from interacting with the Writer so that 
	* the top level API may reliably ignore StringWriter IOExceptions. 
	*
	* @param input CharSequence that is being translated
	* @param index int representing the current point of translation
	* @param out Writer to translate the text to
	* @return int count of codepoints consumed
	* @throws IOException if and only if the Writer produces an IOException
	*/
	public abstract int translate(CharSequence input, int index, Writer out) throws IOException;

这是个虚函数，继承该类都需要实现。在AggregateTranslator的translate方法中就能直接调用集成在这里面的其它对象的translate方法。

AggregateTranslator的translate方法如下：

	/**
	* The first translator to consume codepoints from the input is the 'winner'. 
	* Execution stops with the number of consumed codepoints being returned. 
	* @inheritDoc
	*/
	@Override
	public int translate(final CharSequence input, final int index, final Writer out) throws IOException 
		for (final CharSequenceTranslator translator : translators) 
		    final int consumed = translator.translate(input, index, out);
		    if(consumed != 0) 
		        return consumed;
		    
		
		return 0;

此外，再看一下用的比较频繁的LookupTranslator的实现。
该类的构造函数对传进来的字符映射表进行遍历处理，将二元数组的映射表转换成map保存在lookupMap结构中，便于后续的查找处理，找出每个映射组的前缀保存在prefxSet中。并记录每个二元数组中字符长度最长的和最短的保存在longest和shortest变量中。
其继承实现的translate函数如下：

@Override
    public int translate(final CharSequence input, final int index, final Writer out) throws IOException 
           //从	input的index位置进行比较，只要找到一个就返回
	  // check if translation exists for the input at position index
        if (prefixSet.contains(input.charAt(index))) 
            int max = longest;
            if (index + longest > input.length()) 
                max = input.length() - index;
            
		//先从最长的字符串进行匹配
            // implement greedy algorithm by trying maximum match first
            for (int i = max; i >= shortest; i--) 
                final CharSequence subSeq = input.subSequence(index, index + i);
                final String result = lookupMap.get(subSeq.toString());

                if (result != null) 
                    out.write(result);
                    return i;
                
            
        
        return 0;

具体实现就是这样子的。但是我认为此函数有问题。因为它返回的是char的length而不是代码点的长度。如果lookupTable中的key是含有辅助字符的，在CharSequenceTranslator的tanslate函数处理地方：

	// contract with translators is that they have to understand codepoints
	// and they just took care of a surrogate pair
	for (int pt = 0; pt < consumed; pt++) 
		pos += Character.charCount(Character.codePointAt(input, pos));

应该就会有bug了。这里需要注意一下。
好了，现在对于escapeXml相关函数的工作原理了解清楚了。其实质就是创建CharSequenceTranslator，调用translate函数进行转义。其实我们也可以根据自己的需要组合出自己的CharSequenceTranslator来进行转义，而不调用定制的escapeXml函数。

以上是关于java xml转义方法以及中文字符的处理的主要内容，如果未能解决你的问题，请参考以下文章