源代码中的字符串与从文件中读取的字符串之间有什么区别？

Posted 2021-05-05

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了源代码中的字符串与从文件中读取的字符串之间有什么区别？相关的知识，希望对你有一定的参考价值。

我的磁盘中有一个名为“dd.txt”的文件，它的内容是u5730u7406

现在，当我运行这个程序

public static void main(String[] args) throws IOException {
    FileInputStream fis=new FileInputStream("d:\dd.txt");
    ByteArrayOutputStream baos=new ByteArrayOutputStream();
    byte[] buffer=new byte[fis.available()];
    while ((fis.read(buffer))!=-1) {
        baos.write(buffer);
    }
    String s1="u5730u7406";
    String s2=baos.toString("utf-8");
    System.out.println("s1:"+s1+"
"+"s2:"+s2);
}

我得到了不同的结果

s1:地理
s2:u5730u7406

你能告诉我为什么吗？以及我如何读取该文件并获得与中文s1相同的结果？

答案

当您在Java代码中编写u5730时，编译器将其解释为单个unicode字符（unicode文字）。当你把它写到一个文件时，它只是6个常规字符（因为没有解释它的东西）。你有没有理由不直接写地理文件？

如果你想读取包含unicode文字的文件，你需要自己解析这些值，抛弃u并自己解析unicode代码点。如果你控制文件的创建，首先在文件中用合适的编码（例如UTF-8）编写适当的unicode要容易得多，在正常情况下你永远不会遇到包含这些转义的unicode文字的文件。

另一答案

在您的Java代码中，uxxxx被解释为Unicode文字，因此它们显示为中文字符。这样做只是因为指示编译器这样做。

要获得相同的结果，您必须自己进行一些解析：

String[] hexCodes = s2.split("\\u");
for (String hexCode : hexCodes) {
    if (hexCode.length() == 0)
        continue;
    int intValue = Integer.parseInt(hexCode, 16);
    System.out.print((char)intValue);
}

（请注意，这只适用于每个字符都是Unicode字面形式，例如uxxxx）

另一答案

试试这个：

static final Pattern UNICODE_ESCAPE = Pattern.compile("\\u([0-9a-fA-F]{4})");

static String decodeUnicodeEscape(String s) {
    StringBuilder sb = new StringBuilder();
    int start = 0;
    Matcher m = UNICODE_ESCAPE.matcher(s);
    while (m.find()) {
        sb.append(s.substring(start, m.start()));
        sb.append((char)Integer.parseInt(m.group(1), 16));
        start = m.end();
    }
    sb.append(s.substring(start));
    return sb.toString();
}

public static void main(String[] args) throws IOException {
    // your code ....
    String s1="u5730u7406";
    String s2= decodeUnicodeEscape(baos.toString("utf-8"));
    System.out.println("s1:"+s1+"
"+"s2:"+s2);
}

以上是关于源代码中的字符串与从文件中读取的字符串之间有什么区别？的主要内容，如果未能解决你的问题，请参考以下文章