C 程序眼中的 Unicode

Posted 2021-08-30 garfileo

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了C 程序眼中的 Unicode相关的知识，希望对你有一定的参考价值。

去年写了一篇文章「在 C 程序中处理 UTF-8 字符串」，介绍了如何使用 GLib 提供的 UTF-8 字符串处理函数来实现基本的 UTF-8 文本处理。不过，GLib 是一个功能比较全面的 C 程序库，C 字符串处理仅仅是它的一个很小的模块。属于 GNU 项目一部分的 libunistring 是更专注 Unicode 字符串处理的 C 库，使用 GPL/LGPL 双协议，其规模及功能可能更适合需要处理 UTF-8/UTF-16/UTF-32 文本的 C 项目（C++ 项目应该去找 ICU 库）。本文仅介绍如何使用它操作 UTF-8 文本，不过也并非面面俱到。

复习一下 C 字符串

可将 C 语言的字符串视为字节数组，因为 char 类型的长度是 1 个字节。例如下面的字符串：

char *s = "葫芦娃";

倘若字符串所存储的文本是 UTF-8 编码，那么 strlen(s) 的结果是 9，这是因为 葫芦娃 这三个中文字符的 UTF-8 编码需要 9 个字节。

s 的长度是 9 个字节吗？不是！它是 10 个字节。因为在 C 语言中，字符串常量默认是以 NUL 字符（\\0）结尾，由 C 编译器自动添加。因此，上述代码若直接使用字节数组来表示，应该写成：

char s[10] = "葫芦娃";

或

char s[] = "葫芦娃";

记住，看到 char 类型，首先应该想到字节，而不是字符，除非所谓的字符特指 ASCII 编码的字符。

也就是说，在 C 语言中，字符串仅仅是一组字节，它的意义是什么，依赖于特定的字符编码。

Unicode 与 UTF-8

Unicode 为地球上绝大多数国家、地区、种族的文字提供了统一的编码。

UTF-8 是一种变长编码，Unicode 能做到的事，UTF-8 也能做到。不妨将 Unicode 视为一个协议，而 UTF-8 是一种实现，或者视 Unicode 为体，而 UTF-8 为用，总之 Unicode 编码与 UTF-8 编码可以成双射关系。在应用中，多以 Unicode 为内码，以 UTF-8 作为输入与输出的文本编码。

为什么不直接用 Unicode 呢？因为在 Unicode 中，每个字符的长度为 4 个字节，无论是文件的存储还是网络传输，太耗费资源。即使我们觉得无所谓，习惯 ASCII 码的西方人也会觉得有所谓。

除了 UTF-8，还有一些编码可以与 Unicode 形成双射，例如 UTF-16、UCS-2、UTF-32/UCS-4 等。不过，西方人更喜欢 UTF-8，原因在于 UTF-8 兼容 ASCII 编码。所以，即使全世界都在使用 UTF-8 编码，对西方人的日常生活不仅没有什么影响，而且他们还有机会看到东方的文字。对于我们而言，UTF-8 有些罗嗦。因为大多数汉字，使用 UTF-8 编码，需要 3 个字节，一些生僻字需要 4 个字节，而我国政府推行的编码标准 GB2312 与 GB18030 而言，常用汉字只需要 2 个字节。不过，以前网速不太快，用 2 个字节的汉字编码能节省一些流量，但现在大家都狂放地进入在线视频时代了，一个汉字 3 个字节，似乎也无所谓了。

在 Linux 环境中，不需要考虑太多，用 UTF-8 编码就是政治正确。多数 Linux 发行版的系统编码默认就是 UTF-8。我已经连续使用了 10 多年的 UTF-8 了，至今没有任何生理与心理上的不适，除了偶尔会遇到来自 Windows 的 zip 或 rar 包，解包后，文件名会变成一堆问号的问题。

libunistring 的 UTF-8 字符串函数

libunistring 库中，参数或返回值为 uint8_t * 类型或名称以 u8_ 为前缀的函数都与 UTF-8 字符串有关。更确切地说，uint8_t * 类型便是 UTF-8 字符串，但 libunistring 文档中将其与 uint16_t * 以及 uint32_t * 等类型统称为 Unicode 字符串。

例如：

int u8_strmblen (const uint8_t *s)

libunistring 文档称这个函数可以获取 Unicode 字符串 s 第一个字符的长度（字节数）。

看 libunistring 文档的时候，需要清楚 s 实际上只是个 C 字符串（char *），即字节数组，它以 NUL 结尾。名称以 u8_str 为前缀的函数，所处理的字符串皆为 C 字符串。对于不以 NUL 为结尾的字符串，libunistring 另外提供了一组函数来处理，但是需要向函数提供字符串的长度。例如：

int u8_mblen (const uint8_t *s, size_t n)

在看 libunistring 文档的时候，还需要注意 unit 这个概念。一不小心就会对一些函数产生错误的理解。我一上来就被 u8_strlen 这个函数摆了一道。这个函数的形式如下：

size_t u8_strlen (const uint8_t *s)

我像下面这样调用它：

printf("%zu\\n", u8_strlen("葫芦娃"));

本以为会输出 3，但实际上会输出 9。因为这个函数统计的并非字符串有多少个字符，而是有多少个单元（unit）。对于 UTF-8，单元为 1 个字节，对于 UTF-16，单元为 2 个字节。

文档里是这样解释这个函数的：

Function: size_t u8_strlen (const uint8_t *s)

Returns the number of units in s.

This function is similar to strlen and wcslen, except that it operates on Unicode strings.

事实上，libunistring 并没有提供能够统计 UTF-8 文本所包含字符的个数的函数。不过，基于它的提供的一些基本函数去写一个这样的函数，并不困难。

统计 UTF-8 字符个数

libunistring 为了遍历 UTF-8 字符串，提供了 u8_next 与 u8_prev 函数，使用它们就可以一个一个地数 UTF-8 字符：

/* foo.c */

#include <stdio.h>
#include <unistr.h>

int main(void)
{
        const uint8_t *s = (uint8_t *)"葫芦娃";
        size_t n = 0;
        ucs4_t c;
        for (const uint8_t *it = s; it;it = u8_next(&c, it)) n++;
        n--; /* 不计结尾的 NUL 字符 */
        printf("%zu\\n", n);
        return 0;
}

unistr.h 是 libunistring 的头文件，其中是一组基本的字符串处理函数的声明。

注意，代码中的 ucs4_t c 这个变量。u8_next 不仅仅是数 UTF-8 字符，它在数字符的过程中会顺便将字符转换为 Unicode 编码并传给 c 变量。

用 gcc 编译这个程序然后执行它，命令如下：

$ gcc -std=c11 -pedantic -lunistring foo.c
$ ./a.out

输出结果应该是 3。

注：在我的 Gentoo Linux 系统中，libunistring 库的 .h 文件位于 /usr/include 目录，libunistring.so 文件位于 /usr/lib 目录。其他 Linux 发行版可能未必如此，有可能需要手动设定 .h 与 .so 文件的路径。

由于统计 UTF-8 字符个数这个功能比较常用，因此不妨将它封装为一个函数：

size_t n_of_utf8_chars(const uint8_t *s)
{
        size_t n = 0;
        ucs4_t c;
        for (const uint8_t *it = s; it;it = u8_next(&c, it)) n++;
        n--; /* 不计结尾的 NUL 字符 */
        return n;
}

本地编码转换

上一节的例子实际上是有问题的。我将一个 C 字符串（实际上是一个子字符串，即从 it 所指向的位置直到 NUL 字符所在的位置之间的字符串）直接作为参数传递给了 u8_next 函数。

由于我用的文本编辑器所接受的输入也是 UTF-8 字符串。因此

const char *s = "葫芦娃";

自然也就是一个 UTF-8 编码的字符串了。

倘若在非 UTF-8 环境里，字符串 s 的编码就不是 UTF-8 了，将它传给 u8_next 这样的处理 UTF-8 字符串的函数，结果就会出错了。说白了，在任何一个 C 程序眼里，所有字符串，无论它是什么编码形式，都是字节数组。

要让 libunistring 用于处理 UTF-8 字符串的函数正确地工作，前提是你要保证传递给它的字符串的确是 UTF-8 编码。

如果你用的文本编辑器不支持 UTF-8 编码怎么办？只要你知道你的文本编辑器所用的编码就好办了。libunistring 提供了一组函数，可将各种编码转换为 UTF-8/16/32 编码。

对于以 NUL 为结尾的 C 字符串，可以使用下面这个函数将其转换为 UTF-8 编码：

#include <uniconv.h>

uint8_t * u8_strconv_from_encoding (const char *string, const char *fromcode, enum iconv_ilseq_handler handler)

这个函数，形式上挺复杂，但其实很简单。string 就是要进行编码转换的 C 字符串——字节数组，fromcode 就是这个 C 字符串的编码（与传给 Linux 文件编码转换工具 iconv 编码名称相同，这是因为 libunistring 依赖 libiconv 库，而后者是 POSIX 标准中的 iconv 函数的一种实现）。至于 handler，将它的值设成 iconveh_question_mark 就可以了，意思是遇到未能成功完成编码转换的字符就用 ? 替代。

例如，将 GB18030 编码的字符串转换为 libunistring 意义上的 UTF-8 字符串：

const char *s = "葫芦娃";
uint8_t *s_ = u8_strconv_from_encoding(s, "GB18030", iconveh_question_mark);

总之，用 libunistring 正确处理 UTF-8 字符串的前提在于，你要了解 C 字符串的编码形式，C 字符串是 libunistring 工作对象的源头。

文本分行

造一个使用了某种编码的字符串中存在一些换行符，可以认为这个字符串表示的是多行文本。对这样的文本进行分行，就是检测每个换行符在字符串中的位置。

我们平时所说的换行符是 \\n，它的 ASCII 码用 16 进制来表示，是 0A。倘若 C 程序中的字符串皆以 ASCII 码进行编码，那么从中搜索换行符，是没有任何问题的，亦即只要其中有一个字节，它的值是 0A，那么就可以将它视为换行符，然而在非 ASCII 编码的字符串里，这种做法可能会行不通。因为在非 ASCII 编码中，一个字符通常以多个字节的形式进行编码，有些字符，它的个别编码字节的值可能是 0A，但是却不能因此将这个字符视为换行符。当然，像 UTF-8 这样的编码，已经刻意避开了多字节编码与 ASCII 码的冲突，而 UTF-16/32 就没有对此进行规避。例如在 UTF-16 中， 《 的编码是 0x300A，这时显然不能将值为 0A 的字节视为换行符。

倘若你知道 C 字符串的编码形式，可以使用 libunistring 提供的一组函数从字符串中搜索换行符的位置。例如，对于 UTF-8 编码的字符串，可以用下面这个函数：

#include <unilbrk.h>
void u8_possible_linebreaks (const uint8_t *s, size_t n, const char *encoding, char *p)

下面，考虑从一份 UTF-8 编码的文本文件中读入所有字符，将其存储在 C 字符串数组内，然后使用 u8_possible_linebreaks 获得换行符出现的位置。

#include <stdio.h>
#include <stdlib.h>
#include <unilbrk.h>

int main(void)
{
        FILE *fp = fopen("demo.txt", "r");
        
        /* 获取文件字节数 */
        size_t n;
        fseek(fp, 0, SEEK_END);
        n = ftell(fp);
        rewind(fp);
        
        /* 将文件内容读取到字符数组 */
        uint8_t *s = malloc((n + 1) * sizeof(uint8_t));
        fread(s, n, 1, fp);
        s[n] = \'\\0\';
        
        fclose(fp);

        /* 分行 */
        char *p = malloc((n + 1) * sizeof(char));
        u8_possible_linebreaks(s, n, "UTF-8", p);
        for (size_t i = 0; i < n; i++) {
                if (p[i] == UC_BREAK_MANDATORY) printf("换行符\\n");
                else if (p[i] == UC_BREAK_PROHIBITED) printf("不可拆\\n");
                else printf("其他情况\\n");
        }
        free(p);
        free(s);
        return 0;
}

假设 demo.txt 文件的内容如下：

葫
芦
娃

那么上述程序的输出结果应该是

不可拆
不可拆
不可拆
换行符
不可拆
不可拆
不可拆
换行符
不可拆
不可拆
不可拆
换行符

输出显示，这个文件里共有 4 个换行符，并且也正确识别出了每个 UTF-8 字符的三个字节的编码是不可拆分的。

认真看看代码，不难发现，u8_possible_linebreaks 函数需要一个与 UTF-8 字符串 s 同样长度的字符串 p，它会对 UTF-8 字符串中的每个字节 s[i] 进行识别，如果某个字节属于某个 UTF-8 字符，它就将 p[i] 的值设成 UC_BREAK_PROHIBITED。若 s[i] 是换行符，那么就将 p[i] 的值设成 UC_BREAK_MANDATORY。

需要注意，上述代码中用于获取文件长度的代码片段不够高效。在 Linux 系统中，可以使用 stat 函数来获取。不过使用 fseek 与 ftell 的好处在于平台无关性。

真空字符串

现在来考虑这样一个问题，如何去除 UTF-8 字符串中的所有空白字符？我将去除了空白字符的字符串称为真空字符串。

假设一个 UTF-8 字符串：

const uint8_t *s = (uint8_t *)"      葫  \\t芦\\n娃";

它的真空形式应该是

const uint8_t *t = (uint8_t *)"葫芦娃";

可以仿照 u8_possible_linebreaks 的做法，即构造一个与 s 等长度的字符数组 p。若 s[i] 中的字符为非空白字符，则 p[i] 为 1，否则 p[i] 为 0。但是与 u8_possible_linebreaks 的区别在于，这里的 s[i] 与 p[i] 指的是一个完整的 UTF-8 字符（多个字节），而不再是单个字节。

首先，需要基于 s 的长度为 p 分配空间，还记得上文定义的 n_of_utf8_chars 吧？

size_t  n = n_of_utf8_chars(s);
int *p = malloc(n * sizeof(int));

然后，使用 u8_next 对 s 中的 UTF-8 字符进行枚举：

const uint8_t *it = s;
for (size_t i = 0; i < n; i++) {
        ucs4_t c;
        it = u8_next(&c, it);
        if (uc_is_property_white_space(c)) p[i] = 0;
        else p[i] = 1;
}

uc_is_property_white_space 是 libunistring 提供的用于判断给定的 Unicode 编码是否对应于非字母、数字、图形符号、标点符号等字符类别之外的字符的函数，其形式如下：

#include <unictype.h>
bool uc_is_property_white_space (ucs4_t uc);

在 libunistring 中，类似 uc_is_property_white_space 这样的函数还有许多。例如 uc_is_property_unified_ideograph 可用于判断给定的 Unicode 编码是否为中、日、韩、越文字；uc_is_property_punctuation 可用于判断给定的 Unicode 编码是否为标点符号。

下面给出完整的代码：

#include <stdio.h>
#include <stdlib.h>
#include <unistr.h>
#include <unictype.h>

size_t n_of_utf8_chars(const uint8_t *s)
{
        ucs4_t c;
        size_t n = 0;
        for (const uint8_t *it = s; it;it = u8_next(&c, it)) n++;
        n--; /* 不计结尾的 NUL 字符 */
        return n;
}

int main(void)
{
        const uint8_t *s = (uint8_t *)"      葫  \\t芦\\n娃";
        size_t  n = n_of_utf8_chars(s);
        int *p = malloc(n * sizeof(int));
        const uint8_t *it = s;
        for (size_t i = 0; i < n; i++) {
                ucs4_t c;
                it = u8_next(&c, it);
                if (uc_is_property_white_space(c)) p[i] = 0;
                else p[i] = 1;
        }
        for (size_t i = 0; i < n; i++) {
                printf("%d", p[i]);
        }
        printf("\\n");
        return 0;
}

这个程序的输出结果应该是：

0000001000101

基于值为 1 的 p[i]，从 s 中提取非空白字符，并不困难，本文所涉及到的 libunistring 函数足以胜任，因此不再赘述。

文本字数统计

一些博客类网站提供了文章字数统计功能，但大多只能称得上字符统计。一篇中文的文章，字数统计应该统计文档中含有多少汉字、西文单词、标点符号以及数字。下面这个 C 程序以采用简陋的状态机形式基本上实现了这一功能。

#include <stdio.h>
#include <stdlib.h>
#include <unistr.h>
#include <unictype.h>

static uint8_t *input_text(const char *file_name)
{
        FILE *fp = fopen(file_name, "r");
        size_t n;
        fseek(fp, 0, SEEK_END);
        n = ftell(fp);
        rewind(fp);
        uint8_t *s = malloc((n + 1) * sizeof(uint8_t));
        fread(s, n, 1, fp);
        s[n] = \'\\0\';
        fclose(fp);
        return s;
}

int main(int argc, char **argv)
{
        uint8_t *text = input_text(argv[1]);
        const uint8_t *it = text;
        size_t k = 0, m = 0, n = 0, s = 0;
        enum {INIT, LATIN, CJK, WHITE, NUM, PUNCT} state = INIT;
        while (it) {
                ucs4_t c;
                it = u8_next(&c, it);
                if (!it) break;
                switch(state) {
                case INIT:
                        if (uc_is_property_white_space(c)) state = WHITE;
                        else if (uc_is_property_unified_ideograph(c)) {
                                n++;
                                state = CJK;
                        } else if (uc_is_property_punctuation(c)) {
                                k++;
                                state = PUNCT;
                        } else if (uc_is_property_alphabetic(c)) {
                                m++;
                                state = LATIN;
                        } else if (uc_is_property_numeric(c)) {
                                s++;
                                state = NUM;
                        }
                        break;
                case LATIN:
                        if (uc_is_property_white_space(c)) state = WHITE;
                        else if (uc_is_property_punctuation(c)) {
                                k++;
                                state = PUNCT;
                        } else if (uc_is_property_unified_ideograph(c)) {
                                n++;
                                state = CJK;
                        } else if (uc_is_property_numeric(c)) {
                                s++;
                                state = NUM;
                        }
                        break;
                case CJK:
                        if (uc_is_property_unified_ideograph(c)) n++;
                        else if (uc_is_property_white_space(c)) state = WHITE;
                        else if (uc_is_property_punctuation(c)) {
                                k++;
                                state = PUNCT;
                        } else if (uc_is_property_alphabetic(c)) {
                                m++;
                                state = LATIN;
                        } else if (uc_is_property_numeric(c)) {
                                s++;
                                state = NUM;
                        }
                        break;
                case WHITE:
                        if (uc_is_property_punctuation(c)) {
                                k+;+
                                state = PUNCT;
                        } else if (uc_is_property_unified_ideograph(c)) {
                                n++;
                                state = CJK;
                        } else if (uc_is_property_alphabetic(c)) {
                                m++;
                                state = LATIN;
                        } else if (uc_is_property_numeric(c)) {
                                s++;
                                state = NUM;
                        }
                        break;
                case PUNCT:
                        if (uc_is_property_white_space(c)) state = WHITE;
                        else if (uc_is_property_punctuation(c)) k++;
                        else if (uc_is_property_unified_ideograph(c)) {
                                n++;
                                state = CJK;
                        } else if (uc_is_property_numeric(c)) {
                                s++;
                                state = NUM;
                        } else if (uc_is_property_alphabetic(c)) {
                                m++;
                                state = LATIN;
                        }
                        break;
                case NUM:
                        if (uc_is_property_white_space(c)) state = WHITE;
                        else if (uc_is_property_unified_ideograph(c)) {
                                n++;
                                state = CJK;
                        } else if (uc_is_property_alphabetic(c)) {
                                m++;
                                state = LATIN;
                        } else if (uc_is_property_punctuation(c)) {
                                k++;
                                state = PUNCT;
                        }
                default:
                        break;
                }
        }
        /* 由于 NUL 字符会被 uc_is_property_alphabetic 视为一个字母，所以需要忽略对它的计数 */
        if (m > 0) m--;
        
        printf("汉字: %zu\\n", n);
        printf("西文单词: %zu\\n", m);
        printf("标点符号: %zu\\n", k);
        printf("数字: %zu\\n", s);
        free(text);
        return 0;
}

更进一步

请阅读 libunistring 的文档：https://www.gnu.org/software/libunistring/manual/libunistring.html

以上是关于C 程序眼中的 Unicode的主要内容，如果未能解决你的问题，请参考以下文章