篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Delphi与字符编码(实战篇)(MultiByteToWideChar会返回转换后的宽字符串长度)相关的知识,希望对你有一定的参考价值。
本文目标:
- 了解Delphi的字符串类型
- 字符编码的检测与转换
- 简体繁体转换
0. 导言
看完“.Net与字符编码(理论篇)”,我们明白了字符是自然语言中的最小单位,在存储和传输的过程中可以使用三种编码方法:ASCII、DBCS以及Unicode。常见的DBCS编码有GB2312、GBK和BIG5,而UTF-8、UTF-16和UTF-32则是最常用的Unicode编码类型。
1. 字符串类型
在Delphi中有两种字符串类型:AnsiString和WideString。AnsiString被称为“长字符串”(Long String);WideString则叫做“宽字符串”(Unicode String),它和COM String (BSTR)兼容。它们都是由程序在堆(Heap)上分配的并自动管理内存的分配和释放。目前在Win32平台上,string类型等同于AnsiString。AnsiString还可以理解成字节序列,它支持单字节字符编码(SBCS)、多字节字符编码(MBCS/DBCS)以及UTF-8编码。而WideString使用UTF-16编码,完美支持Unicode。
为了说明字符和字节的区别,我们来看一个计算字符个数的例子:
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
// 假设当前系统页为CP936(GBK 1.0)
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
procedure TestAnsiLength;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
var
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
str: string;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
str := \'汉字ABC\';
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Assert(Length(str) = 7); // 7个字节
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Assert(AnsiLength(str) = 5); // 5个字符
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
下面是AnsiLength的两种实现:
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
// uses SysUtils;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
function AnsiLength(const s: string): integer;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
var
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
p, q: PChar;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Result := 0;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
p := PChar(s);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
q := p + Length(s);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
while p < q do
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Inc(Result);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
if p^ in LeadBytes then // 当前系统代码页的前导字节数组
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Inc(p, 2)
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
else
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Inc(p);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
// uses Windows;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
function AnsiLength(const s: string): Integer;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Result := MultiByteToWideChar(CP_ACP, 0, PAnsiChar(s), -1, nil, 0);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
if Result > 0 then Dec(Result); // 除去终止符
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
如果理解了.Net与字符编码(理论篇)中的编码知识,上面的例子还是很简单的。
2. 字符编码的检测与转换
“工欲善其事,必先利其器”,我先向大家推荐一些工具:
定义基本的类型:
{ 编码类型 }
TEncodingType = (
etAnsi, // ANSI format (SBCS/DBCS)
etUTF8, // UTF-8 format
etUnicode, // UTF-16 format using little endian
etUnicodeBE, // UTF-16 format using big endian
etUTF32, // UTF-32 format using little endian
etUTF32BE // UTF-32 format using big endian
);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
{ 字节顺序标记 }
TByteOrderMask = array of Byte;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
获得不同编码类型的BOM:
CopyBytes
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
function TryGetBOM(const encodingType: TEncodingType; var bom: TByteOrderMask): Boolean;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Result := True;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
case encodingType of
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
etUTF8: CopyBytes(BOM_Utf8, bom);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
etUnicode: CopyBytes(BOM_UTF16_LSB, bom);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
etUnicodeBE: CopyBytes(BOM_UTF16_MSB, bom);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
etUTF32: CopyBytes(BOM_UTF32_LSB, bom);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
etUTF32BE: CopyBytes(BOM_UTF32_MSB, bom);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
else
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
SetLength(bom, 0);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Result := False;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
检测字符编码类型:
CompareBOM
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
function DetectEncoding(buffer: PAnsiChar): TEncodingType; overload;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
if CompareBOM(buffer, BOM_UTF8) then
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Result := etUTF8
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
else if CompareBOM(buffer, BOM_UTF16_LSB) then
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Result := etUnicode
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
else if CompareBOM(buffer, BOM_UTF16_MSB) then
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Result := etUnicodeBE
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
else if CompareBOM(buffer, BOM_UTF32_LSB) then
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Result := etUTF32
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
else if CompareBOM(buffer, BOM_UTF32_MSB) then
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Result := etUTF32BE
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
else
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Result := etAnsi;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
function DetectEncoding(stream: TStream): TEncodingType; overload;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
var
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
pos: Int64;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
bytes: TByteOrderMask;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
SetLength(bytes, 6);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
ZeroMemory(@bytes[0], Length(bytes));
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
pos := stream.Seek(0, soFromCurrent);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
stream.Seek(0, soFromBeginning);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
stream.Read(bytes[0], SizeOf(bytes));
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
stream.Seek(pos, soFromBeginning);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
Result := DetectEncoding(PAnsiChar(@bytes[0]));
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
下面的方法演示了如何用不同的编码类型来保存文本:
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
procedure WriteText(stream: TStream; const buffer: WideString;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
const encodingType: TEncodingType; withBom: Boolean = False);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
var
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
s: AnsiString;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
p: PAnsiChar;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
bom: TByteOrderMask;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
bytes: Integer;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
p := nil;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
bytes := Length(buffer) * SizeOf(WideChar);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
if withBom and TryGetBOM(encodingType, bom) then
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
stream.Write(bom[0], Length(bom));
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
case encodingType of
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
etAnsi:
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
p := PAnsiChar(buffer);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
bytes := Length(buffer);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
etUTF8:
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
s := Utf8Encode(buffer);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
p := PAnsiChar(s);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
bytes := Length(s);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
etUnicode:
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
p := PAnsiChar(PWideChar(buffer));
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
etUnicodeBE:
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
StrSwapByteOrder(PWideChar(buffer));
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
p := PAnsiChar(PWideChar(buffer));
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
else // 留给读者去实现
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
begin
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
raise Exception.Create(\'Not Implemented.\');
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
stream.Write(p^, bytes);
![](https://image.cha138.com/20210611/f5e9fd211d194f8787aa0090a77d1003.jpg)
end;
需要说明的是,如果把这些过程封装成对象的话,结构会更清晰。
3. 简体繁体转换
简体繁体转换包括简转繁和繁转简两种情况,其原理是利用查找字符编码映射表来查找相应的字符。网上有一个“利用编码对照表完成内码转换和简繁体转换的单元”就是基于这个原理写的,在这里就暂不详述了。
{ TODO: 采用OOP来封装字符编码模块,并提供下载 }
{ TODO: 研究简体繁体转换 }
参考文章
http://www.cnblogs.com/baoquan/articles/1027371.html