Unicode Locale
Posted searchor
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Unicode Locale相关的知识,希望对你有一定的参考价值。
What Is a Locale?
A key concept for application programs is that of a program‘s locale. The locale is an explicit model and definition of a native-language environment. The notion of a locale is explicitly defined and included in the POSIX standard which can be accessed through http://opengroup.org.
A locale consists of a number of categories for which country-dependent formatting or other specifications exist. A program‘s locale defines its code sets, date and time formatting conventions, monetary conventions, decimal formatting conventions, and collation (sort) order.
A locale name can be composed of a base language, country (territory) of use, and codeset. For example, German language is de, an abbreviation for Deutsch, while Swiss German is de_CH, CH being an abbreviation for Confederation Helvetica. This convention allows for specific differences by country, such as currency unit notation. In Oracle Solaris 11 the default locale codeset is UTF-8, an ASCII compatible 8-bit encoding form of Unicode. The fully defined locale name for Swiss German would thus be de_CH.UTF-8.
More than one locale can be associated with a particular language, which allows for regional differences. For example, an English-speaking user in the United States can select the en_US.UTF-8 locale (English for the United States), while an English-speaking user in Great Britain can select en_GB.UTF-8 (English for Great Britain).
Generally the locale name is specified by the LANG environment variable. Locale categories are subordinate to LANG but can be set separately, in which case they override LANG. If the LC_ALL environment variable is set, it overrides LANG and all the separate locale categories.
The locale naming convention is:
language[_territory][.codeset][@modifier]
where a two-letter language code is from ISO 639, a two-letter territory code is from ISO 3166, codeset is the name of the codeset that is being used in the locale, and modifier is the name of the characteristics that differentiate the locale from the locale without the modifier.
All Oracle Solaris product locales preserve the Portable Character Set characters with US-ASCII code values.
For more information about the portable character set, refer to X/Open CAE Specification: System Interface Definitions, Issue 5" (ISBN 1-85912-186-1).
A single locale can have more than one locale name. For example, POSIX is the same locale as C.
C Locale
The C locale, also known as the POSIX locale, is the POSIX system default locale for all POSIX-compliant systems. The Oracle Solaris operating system is a POSIX system. The Single UNIX Specification, Version 3, defines the C locale. You can register at http://www.unix.org/version3/online.html to read and download the specification.
You can specify your internationalized programs to run in the C locale in the following two ways:
-
Unset all locale environment variables. Runs the application in the C locale.
$ unset LC_ALL LANG LC_CTYPE LC_COLLATE LC_NUMERIC LC_TIME LC_MONETARY LC_MESSAGES
-
Explicitly set the locale to C or POSIX.
$ export LC_ALL=C $ export LANG=C
Some applications check the LANG environment variables without actually calling setlocale(3C) to reference the current locale. In this case, shell is explicitly set to the C locale by specifying the LC_ALL and LANG locale environment variables. For the precedence relationship among locale environment variables, see the setlocale(3C) man page.
To check the current locale settings in a terminal environment, run the locale(1) command.
$ locale LANG=C LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_ALL=
Locale Categories
The types of locale categories are as follows:
- LC_CTYPE
-
Character classification and case conversion.
- LC_TIME
-
Specifies date and time formats, including month names, days of the week, and common full and abbreviated representations.
- LC_MONETARY
-
Specifies monetary formats, including the currency symbol for the locale, thousands separator, sign position, the number of fractional digits, and so forth.
- LC_NUMERIC
-
Specifies the decimal delimiter (or radix character), the thousands separator, and the grouping.
- LC_COLLATE
-
Specifies a collation order and regular expression definition for the locale.
- LC_MESSAGES
-
Specifies the language in which the localized messages are written, and affirmative and negative responses of the locale (yes and no strings and expressions).
- LO_LTYPE
-
Specifies the layout engine that provides information about language rendering. Language rendering (or text rendering) depends on the shape and direction attributes of a script.
Core Locales
The following table lists Oracle Solaris 11 core locales:
Table 1-1 Languages and Core locales
|
Core locales have better coverage at the level of localized messages than the locales available for additional installation. Oracle Solaris OS components such as Installer or Package Manager are localized only in core locales while localized messages for third-party software such as GNOME or Firefox are often available in more locales.
All locales in the Oracle Solaris environment are capable of displaying localized messages, provided that the localized messages for the relevant language and application are present. Additional locales including all their available localized messages can be added to the system from the installation repository by modification of pkg facet properties. For more information, see Installing Additional Locales.
ISO-3166 Country Codes and ISO-639 Language Codes
This chapter contains the tables which provide the list of ISO Codes. Table 20-1 provides the list of the ISO-3166 Country Codes and Table 20-2 lists the ISO-639 Language Codes
ISO-3166 Country Codes
Related Topics
An important note for developers of UTF-8 decoding routines: For security reasons, a UTF-8 decoder must not accept UTF-8 sequences that are longer than necessary to encode a character. For example, the character U+000A (line feed) must be accepted from a UTF-8 stream only in the form 0x0A, but not in any of the following five possible overlong forms:
0xC0 0x8A 0xE0 0x80 0x8A 0xF0 0x80 0x80 0x8A 0xF8 0x80 0x80 0x80 0x8A 0xFC 0x80 0x80 0x80 0x80 0x8A
Any overlong UTF-8 sequence could be abused to bypass UTF-8 substring tests that look only for the shortest possible encoding. All overlong UTF-8 sequences start with one of the following byte patterns:
1100000x (10xxxxxx) |
11100000 100xxxxx (10xxxxxx) |
11110000 1000xxxx (10xxxxxx 10xxxxxx) |
11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx) |
11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx) |
Also note that the code positions U+D800 to U+DFFF (UTF-16 surrogates) as well as U+FFFE and U+FFFF must not occur in normal UTF-8 or UCS-4 data. UTF-8 decoders should treat them like malformed or overlong sequences for safety reasons.
以上是关于Unicode Locale的主要内容,如果未能解决你的问题,请参考以下文章
Unicode 范围以及python中生成所有Unicode的方法