Java如何检测替换4个字节的utf-8编码(此范围编码包含emoji)
Posted 尼克同学的博客
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Java如何检测替换4个字节的utf-8编码(此范围编码包含emoji)相关的知识,希望对你有一定的参考价值。
> 参考的优秀文章
3、【异常处理】Incorrect string value: \'\\xF0\\x90\\x8D\\x83...\' for column... Emoji表情字符过滤的Java实现
4、Why a surrogate java regexp finds hypen-minus
> 如何检测、替换4个字节的utf-8编码(此范围编码包含emoji)
项目有个需求,是保存从手机端H5页面提交的信息。
大家知道,手机端输入法中经常有自带的表情,其中emoji表情非常流行,如果用户输入emoji表情,由于有部分emoji表情是4个字节的utf-8编码,我们的mysql数据库在现有版本和编码设置下只能保存3个字节的utf-8编码(如要保存4个字节的utf-8编码则需升级版本和设置另一种编码)。相关信息可见文章《十分钟搞清字符集和字符编码》。
我们的需求不需要支持emoji表情,如果遇到emoji弹出提示或过滤即可。
通过浏览《【异常处理】Incorrect string value: \'\\xF0\\x90\\x8D\\x83...\' for column... Emoji表情字符过滤的Java实现》和《Why a surrogate java regexp finds hypen-minus》,我们得知通过以下代码进行替换:
msg.replaceAll("[\\\\ud800\\\\udc00-\\\\udbff\\\\udfff\\\\ud800-\\\\udfff]", "");
效果是OK的。
但是,由于能力原因,始终没能理解上述代码十六进制正则表达式的原理,自己写了一端代码来检测、替换4个字节的utf-8编码(但未能经过完整测试,仅用于描述大概思路)。
其中UTF-8编码规则阅读自《十分钟搞清字符集和字符编码》,字节与十六进制的转换参考自《 Java中byte与16进制字符串的互相转换》。
package com.nicchagil.tc.emojifilter; import java.util.Arrays; import java.util.HashMap; import java.util.Map; public class UTF8Utils { public static void main(String[] args) { String s = "琥珀蜜蜡由于硬度很低,打磨起来非123常简单,需要的工具也非常简单,自己去买蜜蜡是非常不划算的,完全可以自己磨蜜蜡原石的样子"; System.out.println(UTF8Utils.bytesToHex(s.getBytes())); System.out.println(UTF8Utils.bytesToHex(UTF8Utils.remove4BytesUTF8Char(s))); } public static Map<String, Integer> hexMap = new HashMap<String, Integer>(); public static Map<String, Integer> byteMap = new HashMap<String, Integer>(); static { hexMap.put("0", 2); hexMap.put("1", 2); hexMap.put("2", 2); hexMap.put("3", 2); hexMap.put("4", 2); hexMap.put("5", 2); hexMap.put("6", 2); hexMap.put("7", 2); hexMap.put("c", 4); hexMap.put("d", 4); hexMap.put("e", 6); hexMap.put("f", 8); byteMap.put("0", 1); byteMap.put("1", 1); byteMap.put("2", 1); byteMap.put("3", 1); byteMap.put("4", 1); byteMap.put("5", 1); byteMap.put("6", 1); byteMap.put("7", 1); byteMap.put("c", 2); byteMap.put("d", 2); byteMap.put("e", 3); byteMap.put("f", 4); } /** * 是否包含4字节UTF-8编码的字符(先转换16进制再判断) * @param s 字符串 * @return 是否包含4字节UTF-8编码的字符 */ public static boolean contains4BytesChar(String s) { if (s == null || s.trim().length() == 0) { return false; } String hex = UTF8Utils.bytesToHex(s.getBytes()); System.out.println("full hex : " + hex); String firstChar = null; while (hex != null && hex.length() > 1) { firstChar = hex.substring(0, 1); System.out.println("firstChar : " + firstChar); if ("f".equals(firstChar)) { System.out.println("it is f start, it is 4 bytes, return."); return true; } if (hexMap.get(firstChar) == null) { System.out.println("it is f start, it is 4 bytes, return."); // todo, throw exception for this case return false; } hex = hex.substring(hexMap.get(firstChar), hex.length()); System.out.println("remain hex : " + hex); } return false; } /** * 是否包含4字节UTF-8编码的字符 * @param s 字符串 * @return 是否包含4字节UTF-8编码的字符 */ public static boolean contains4BytesChar2(String s) { if (s == null || s.trim().length() == 0) { return false; } byte[] bytes = s.getBytes(); if (bytes == null || bytes.length == 0) { return false; } int index = 0; byte b; String hex = null; String firstChar = null; int step; while (index <= bytes.length - 1) { System.out.println("while loop, index : " + index); b = bytes[index]; hex = byteToHex(b); if (hex == null || hex.length() < 2) { System.out.println("fail to check whether contains 4 bytes char(1 byte hex char too short), default return false."); // todo, throw exception for this case return false; } firstChar = hex.substring(0, 1); if (firstChar.equals("f")) { return true; } if (byteMap.get(firstChar) == null) { System.out.println("fail to check whether contains 4 bytes char(no firstchar mapping), default return false."); // todo, throw exception for this case return false; } step = byteMap.get(firstChar); System.out.println("while loop, index : " + index + ", step : " + step); index = index + step; } return false; } /** * 去除4字节UTF-8编码的字符 * @param s 字符串 * @return 已去除4字节UTF-8编码的字符 */ public static byte[] remove4BytesUTF8Char(String s) { byte[] bytes = s.getBytes(); byte[] removedBytes = new byte[bytes.length]; int index = 0; String hex = null; String firstChar = null; for (int i = 0; i < bytes.length; ) { hex = UTF8Utils.byteToHex(bytes[i]); if (hex == null || hex.length() < 2) { System.out.println("fail to check whether contains 4 bytes char(1 byte hex char too short), default return false."); // todo, throw exception for this case return null; } firstChar = hex.substring(0, 1); if (byteMap.get(firstChar) == null) { System.out.println("fail to check whether contains 4 bytes char(no firstchar mapping), default return false."); // todo, throw exception for this case return null; } if (firstChar.equals("f")) { for (int j = 0; j < byteMap.get(firstChar); j++) { i++; } continue; } for (int j = 0; j < byteMap.get(firstChar); j++) { removedBytes[index++] = bytes[i++]; } } return Arrays.copyOfRange(removedBytes, 0, index); } /** * 将字符串的16进制转换为HEX,并按每个字符的16进制分隔格式化 * @param s 字符串 */ public static String splitForReading(String s) { if (s == null || s.trim().length() == 0) { return ""; } String hex = UTF8Utils.bytesToHex(s.getBytes()); System.out.println("full hex : " + hex); if (hex == null || hex.length() == 0) { System.out.println("fail to translate the bytes to hex."); // todo, throw exception for this case return ""; } StringBuilder sb = new StringBuilder(); int index = 0; String firstChar = null; String splittedString = null; while (index < hex.length()) { firstChar = hex.substring(index, index + 1); if (hexMap.get(firstChar) == null) { System.out.println("fail to check whether contains 4 bytes char(no firstchar mapping), default return false."); // todo, throw exception for this case return ""; } splittedString = hex.substring(index, index + hexMap.get(firstChar)); sb.append(splittedString).append(" "); index = index + hexMap.get(firstChar); } System.out.println("formated sb : " + sb); return sb.toString(); } /** * 字节数组转十六进制 * @param bytes 字节数组 * @return 十六进制 */ public static String bytesToHex(byte[] bytes) { if (bytes == null || bytes.length == 0) { return null; } StringBuilder sb = new StringBuilder(); for (int i = 0; i < bytes.length; i++) { int r = bytes[i] & 0xFF; String hexResult = Integer.toHexString(r); if (hexResult.length() < 2) { sb.append(0); // 前补0 } sb.append(hexResult); } return sb.toString(); } /** * 字节转十六进制 * @param b 字节 * @return 十六进制 */ public static String byteToHex(byte b) { int r = b & 0xFF; String hexResult = Integer.toHexString(r); StringBuilder sb = new StringBuilder(); if (hexResult.length() < 2) { sb.append(0); // 前补0 } sb.append(hexResult); return sb.toString(); } }
在随便看下各种字符的UTF-8编码是什么:
package com.nicchagil.tc.emojifilter; public class UTF8HexTester { public static void main(String[] args) { String s = "1"; System.out.println("the hex of “" + s + "” : " + UTF8Utils.bytesToHex(s.getBytes())); s = "a"; System.out.println("the hex of “" + s + "” : " + UTF8Utils.bytesToHex(s.getBytes())); s = "我"; System.out.println("the hex of “" + s + "” : " + UTF8Utils.bytesToHex(s.getBytes())); s = "我很帅"; System.out.println("the hex of “" + s + "” : " + UTF8Utils.bytesToHex(s.getBytes())); } }
日志:
the hex of “1” : 31 the hex of “a” : 61 the hex of “我” : e68891 the hex of “我很帅” : e68891e5be88e5b885
> 搭建一个测试渠道来测试
由于emoji表情在PC不易输入,最好的输入途径始终在手机上,那么我们搭一个简单的web程序来接收emoji表情吧~
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Emoji</title> </head> <script type="text/javascript" src="https://code.jquery.com/jquery-1.12.3.min.js"></script> <body> <form id="myform" action="http://192.168.1.3:8080/emoji/EmojiFilterServlet" > Input parameter : <input type=\'text\' name=\'msg\' /> <br/> <input type=\'button\' value=\' ajax submit \' onclick="save();" /> <input type=\'submit\' value=\' form submit \' /> </form> </body> <script type="text/javascript"> function save() { // alert(\'start save...\'); var data = $(\'#myform\').serialize(); // alert(data); $.ajax({ type : "POST", url : "http://192.168.1.3:8080/emoji/EmojiFilterServlet", data : data, success : function(d) { alert(d); } }); } </script> </html>
package com.nicchagil.tc.emojifilter; import java.io.IOException; import java.nio.charset.Charset; import javax.servlet.ServletException; import javax.servlet.http.HttpServlet; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse; /** * Servlet implementation class EmojiFilterServlet */ public class EmojiFilterServlet extends HttpServlet { private static final long serialVersionUID = 1L; /** * @see HttpServlet#HttpServlet() */ public EmojiFilterServlet() { super(); // TODO Auto-generated constructor stub } /** * @see HttpServlet#doGet(HttpServletRequest request, HttpServletResponse response) */ protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { this.doPost(request, response); } /** * @see HttpServlet#doPost(HttpServletRequest request, HttpServletResponse response) */ protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { String msg = request.getParameter("msg"); System.out.println("msg -> " + msg); } }
以上是关于Java如何检测替换4个字节的utf-8编码(此范围编码包含emoji)的主要内容,如果未能解决你的问题,请参考以下文章
如何将4字节utf-8的emoji表情转换为unicode字符编码