Unicode Locale

Posted searchor

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Unicode Locale相关的知识,希望对你有一定的参考价值。

What Is a Locale?

A key concept for application programs is that of a program‘s locale. The locale is an explicit model and definition of a native-language environment. The notion of a locale is explicitly defined and included in the POSIX standard which can be accessed through http://opengroup.org.

A locale consists of a number of categories for which country-dependent formatting or other specifications exist. A program‘s locale defines its code sets, date and time formatting conventions, monetary conventions, decimal formatting conventions, and collation (sort) order.

A locale name can be composed of a base language, country (territory) of use, and codeset. For example, German language is de, an abbreviation for Deutsch, while Swiss German is de_CHCH being an abbreviation for Confederation Helvetica. This convention allows for specific differences by country, such as currency unit notation. In Oracle Solaris 11 the default locale codeset is UTF-8, an ASCII compatible 8-bit encoding form of Unicode. The fully defined locale name for Swiss German would thus be de_CH.UTF-8.

More than one locale can be associated with a particular language, which allows for regional differences. For example, an English-speaking user in the United States can select the en_US.UTF-8 locale (English for the United States), while an English-speaking user in Great Britain can select en_GB.UTF-8 (English for Great Britain).

Generally the locale name is specified by the LANG environment variable. Locale categories are subordinate to LANG but can be set separately, in which case they override LANG. If the LC_ALL environment variable is set, it overrides LANG and all the separate locale categories.

The locale naming convention is:

language[_territory][.codeset][@modifier]

where a two-letter language code is from ISO 639, a two-letter territory code is from ISO 3166, codeset is the name of the codeset that is being used in the locale, and modifier is the name of the characteristics that differentiate the locale from the locale without the modifier.

All Oracle Solaris product locales preserve the Portable Character Set characters with US-ASCII code values.

For more information about the portable character set, refer to X/Open CAE Specification: System Interface Definitions, Issue 5" (ISBN 1-85912-186-1).

A single locale can have more than one locale name. For example, POSIX is the same locale as C.

C Locale

The C locale, also known as the POSIX locale, is the POSIX system default locale for all POSIX-compliant systems. The Oracle Solaris operating system is a POSIX system. The Single UNIX Specification, Version 3, defines the C locale. You can register at http://www.unix.org/version3/online.html to read and download the specification.

You can specify your internationalized programs to run in the C locale in the following two ways:

  • Unset all locale environment variables. Runs the application in the C locale.

    $ unset LC_ALL LANG LC_CTYPE LC_COLLATE LC_NUMERIC LC_TIME LC_MONETARY LC_MESSAGES
  • Explicitly set the locale to C or POSIX.

    $ export LC_ALL=C
    $ export LANG=C

    Some applications check the LANG environment variables without actually calling setlocale(3C) to reference the current locale. In this case, shell is explicitly set to the C locale by specifying the LC_ALL and LANG locale environment variables. For the precedence relationship among locale environment variables, see the setlocale(3C) man page.

To check the current locale settings in a terminal environment, run the locale(1) command.

$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

Locale Categories

The types of locale categories are as follows:

LC_CTYPE

Character classification and case conversion.

LC_TIME

Specifies date and time formats, including month names, days of the week, and common full and abbreviated representations.

LC_MONETARY

Specifies monetary formats, including the currency symbol for the locale, thousands separator, sign position, the number of fractional digits, and so forth.

LC_NUMERIC

Specifies the decimal delimiter (or radix character), the thousands separator, and the grouping.

LC_COLLATE

Specifies a collation order and regular expression definition for the locale.

LC_MESSAGES

Specifies the language in which the localized messages are written, and affirmative and negative responses of the locale (yes and no strings and expressions).

LO_LTYPE

Specifies the layout engine that provides information about language rendering. Language rendering (or text rendering) depends on the shape and direction attributes of a script.

Core Locales

The following table lists Oracle Solaris 11 core locales:

Table 1-1 Languages and Core locales

Language
Core locale
Chinese - Simplified
zh_CN.UTF-8
Chinese - Traditional
zh_TW.UTF-8
English
en_US.UTF-8
French
fr_FR.UTF-8
German
de_DE.UTF-8
Italian
it_IT.UTF-8
Japanese
ja_JP.UTF-8
Korean
ko_KR.UTF-8
Portuguese - Brazilian
pt_BR.UTF-8
Spanish
es_ES.UTF-8

Core locales have better coverage at the level of localized messages than the locales available for additional installation. Oracle Solaris OS components such as Installer or Package Manager are localized only in core locales while localized messages for third-party software such as GNOME or Firefox are often available in more locales.

All locales in the Oracle Solaris environment are capable of displaying localized messages, provided that the localized messages for the relevant language and application are present. Additional locales including all their available localized messages can be added to the system from the installation repository by modification of pkg facet properties. For more information, see Installing Additional Locales.

 

 

 

 

 

ISO-3166 Country Codes and ISO-639 Language Codes

This chapter contains the tables which provide the list of ISO Codes. Table 20-1 provides the list of the ISO-3166 Country Codes and Table 20-2 lists the ISO-639 Language Codes

 


ISO-3166 Country Codes

Table 20-1  ISO-3166 Country Codes
Country
ISO-3166 Country Code
AFGHANISTAN
AF
ALBANIA
AL
ALGERIA
DZ
AMERICAN SAMOA
AS
ANDORRA
AD
ANGOLA
AO
ANTARCTICA
AQ
ANTIGUA AND BARBUDA
AG
ARGENTINA
AR
ARMENIA
AM
ARUBA
AW
AUSTRALIA
AU
AUSTRIA
AT
AZERBAIJAN
AZ
BAHAMAS
BS
BAHRAIN
BH
BANGLADESH
BD
BARBADOS
BB
BELARUS
BY
BELGIUM
BE
BELIZE
BZ
BENIN
BJ
BERMUDA
BM
BHUTAN
BT
BOLIVIA
BO
BOSNIA AND HERZEGOVINA
BA
BOTSWANA
BW
BOUVET ISLAND
BV
BRAZIL
BR
BRITISH INDIAN OCEAN TERRITORY
IO
BRUNEI DARUSSALAM
BN
BULGARIA
BG
BURKINA FASO
BF
BURUNDI
BI
CAMBODIA
KH
CAMEROON
CM
CANADA
CA
CAPE VERDE
CV
CAYMAN ISLANDS
KY
CENTRAL AFRICAN REPUBLIC
CF
CHAD
TD
CHILE
CL
CHINA
CN
CHRISTMAS ISLAND
CX
COCOS (KEELING) ISLANDS
CC
COLOMBIA
CO
COMOROS
KM
CONGO
CG
CONGO, THE DEMOCRATIC REPUBLIC OF THE
CD
COOK ISLANDS
CK
COSTA RICA
CR
CÔTE D‘IVOIRE
CI
CROATIA
HR
CUBA
CU
CYPRUS
CY
CZECH REPUBLIC
CZ
DENMARK
DK
DJIBOUTI
DJ
DOMINICA
DM
DOMINICAN REPUBLIC
DO
ECUADOR
EC
EGYPT
EG
EL SALVADOR
SV
EQUATORIAL GUINEA
GQ
ERITREA
ER
ESTONIA
EE
ETHIOPIA
ET
FALKLAND ISLANDS (MALVINAS)
FK
FAROE ISLANDS
FO
FIJI
FJ
FINLAND
FI
FRANCE
FR
FRENCH GUIANA
GF
FRENCH POLYNESIA
PF
FRENCH SOUTHERN TERRITORIES
TF
GABON
GA
GAMBIA
GM
GEORGIA
GE
GERMANY
DE
GHANA
GH
GIBRALTAR
GI
GREECE
GR
GREENLAND
GL
GRENADA
GD
GUADELOUPE
GP
GUAM
GU
GUATEMALA
GT
GUINEA
GN
GUINEA-BISSAU
GW
GUYANA
GY
HAITI
HT
HEARD ISLAND AND MCDONALD ISLANDS
HM
HONDURAS
HN
HONG KONG
HK
HUNGARY
HU
ICELAND
IS
INDIA
IN
INDONESIA
ID
IRAN, ISLAMIC REPUBLIC OF
IR
IRAQ
IQ
IRELAND
IE
ISRAEL
IL
ITALY
IT
JAMAICA
JM
JAPAN
JP
JORDAN
JO
KAZAKHSTAN
KZ
KENYA
KE
KIRIBATI
KI
KOREA, DEMOCRATIC PEOPLE‘S REPUBLIC OF
KP
KOREA, REPUBLIC OF
KR
KUWAIT
KW
KYRGYZSTAN
KG
LAO PEOPLE‘S DEMOCRATIC REPUBLIC
LA
LATVIA
LV
LEBANON
LB
LESOTHO
LS
LIBERIA
LR
LIBYAN ARAB JAMAHIRIYA
LY
LIECHTENSTEIN
LI
LITHUANIA
LT
LUXEMBOURG
LU
MACAO
MO
MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF
MK
MADAGASCAR
MG
MALAWI
MW
MALAYSIA
MY
MALDIVES
MV
MALI
ML
MALTA
MT
MARSHALL ISLANDS
MH
MARTINIQUE
MQ
MAURITANIA
MR
MAURITIUS
MU
MAYOTTE
YT
MEXICO
MX
MICRONESIA, FEDERATED STATES OF
FM
MOLDOVA, REPUBLIC OF
MD
MONACO
MD
MONGOLIA
MN
MONTSERRAT
MS
MOROCCO
MA
MOZAMBIQUE
MZ
MYANMAR
MM
NAMIBIA
NA
NAURU
NR
NEPAL
NP
NETHERLANDS
NL
NETHERLANDS ANTILLES
AN
NEW CALEDONIA
NC
NEW ZEALAND
NZ
NICARAGUA
NI
NIGER
NE
NIGERIA
NG
NIUE
NU
NORFOLK ISLAND
NF
NORTHERN MARIANA ISLANDS
MP
NORWAY
NO
OMAN
OM
PAKISTAN
PK
PALAU
PW
PALESTINIAN TERRITORY, OCCUPIED
PS
PANAMA
PA
PAPUA NEW GUINEA
PG
PARAGUAY
PY
PERU
PE
PHILIPPINES
PH
PITCAIRN
PN
POLAND
PL
PUERTO RICO
PR
QATAR
QA
RÉUNION
RE
ROMANIA
RO
RUSSIAN FEDERATION
RU
RWANDA
RW
SAINT HELENA
SH
SAINT KITTS AND NEVIS
KN
SAINT LUCIA
LC
SAINT PIERRE AND MIQUELON
PM
SAINT VINCENT AND THE GRENADINES
VC
SAMOA
WS
SAN MARINO
SM
SAO TOME AND PRINCIPE
ST
SAUDI ARABIA
SA
SENEGAL
SN
SERBIA AND MONTENEGRO
CS
SEYCHELLES
SC
SIERRA LEONE
SL
SINGAPORE
SG
SLOVAKIA
SK
SLOVENIA
SI
SOLOMON ISLANDS
SB
SOMALIA
SO
SOUTH AFRICA
ZA
SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS
GS
SPAIN
ES
SRI LANKA
LK
SUDAN
SD
SURINAME
SR
SVALBARD AND JAN MAYEN
SJ
SWAZILAND
SZ
SWEDEN
SE
SWITZERLAND
CH
SYRIAN ARAB REPUBLIC
SY
TAIWAN, PROVINCE OF CHINA
TW
TAJIKISTAN
TJ
TANZANIA, UNITED REPUBLIC OF
TZ
THAILAND
TH
TIMOR-LESTE
TL
TOGO
TG
TOKELAU
TK
TONGA
TO
TRINIDAD AND TOBAGO
TT
TUNISIA
TN
TURKEY
TR
TURKMENISTAN
TM
TURKS AND CAICOS ISLANDS
TC
TUVALU
TV
UGANDA
UG
UKRAINE
UA
UNITED ARAB EMIRATES
AE
UNITED KINGDOM
GB
UNITED STATES
US
UNITED STATES MINOR OUTLYING ISLANDS
UM
URUGUAY
UY
UZBEKISTAN
UZ
VANUATU
VU
VENEZUELA
VE
VIET NAM
VN
VIRGIN ISLANDS, BRITISH
VG
VIRGIN ISLANDS, U.S.
VI
WALLIS AND FUTUNA
WF
WESTERN SAHARA
EH
YEMEN
YE
ZAMBIA
ZM
ZIMBABWE
ZW

 

 ISO-639 Language Codes
Table 20-2 ISO-639 Language Codes
Language
ISO-639 Language Code
Abkhazian
ab
Afar
aa
Afrikaans
af
Albanian
sq
Amharic
am
Arabic
ar
Armenian
hy
Assamese
as
Aymara
ay
Azerbaijani
az
Bashkir
ba
Basque
eu
Bengali (Bangla)
bn
Bhutani
dz
Bihari
bh
Bislama
bi
Breton
br
Bulgarian
bg
Burmese
my
Byelorussian (Belarusian)
be
Cambodian
km
Catalan
ca
Chinese (Simplified)
zh
Chinese (Traditional)
zh
Corsican
co
Croatian
hr
Czech
cs
Danish
da
Dutch
nl
English
en
Esperanto
eo
Estonian
et
Faeroese
fo
Farsi
fa
Fiji
fj
Finnish
fi
French
fr
Frisian
fy
Galician
gl
Gaelic (Scottish)
gd
Gaelic (Manx)
gv
Georgian
ka
German
de
Greek
el
Greenlandic
kl
Guarani
gn
Gujarati
gu
Hausa
ha
Hebrew
he
Hindi
hi
Hungarian
hu
Icelandic
is
Indonesian
id
Interlingua
ia
Interlingue
ie
Inuktitut
iu
Inupiak
ik
Irish
ga
Italian
it
Japanese
ja
Javanese
ja
Kannada
kn
Kashmiri
ks
Kazakh
kk
Kinyarwanda (Ruanda)
rw
Kirghiz
ky
Kirundi (Rundi)
rn
Korean
ko
Kurdish
ku
Laothian
lo
Latin
la
Latvian (Lettish)
lv
Limburgish ( Limburger)
li
Lingala
ln
Lithuanian
lt
Macedonian
mk
Malagasy
mg
Malay
ms
Malayalam
ml
Maltese
mt
Maori
mi
Marathi
mr
Moldavian
mo
Mongolian
mn
Nauru
na
Nepali
ne
Norwegian
no
Occitan
oc
Oriya
or
Oromo (Afan, Galla)
om
Pashto (Pushto)
ps
Polish
pl
Portuguese
pt
Punjabi
pa
Quechua
qu
Rhaeto-Romance
rm
Romanian
ro
Russian
ru
Samoan
sm
Sangro
sg
Sanskrit
sa
Serbian
sr
Serbo-Croatian
sh
Sesotho
st
Setswana
tn
Shona
sn
Sindhi
sd
Sinhalese
si
Siswati
ss
Slovak
sk
Slovenian
sl
Somali
so
Spanish
es
Sundanese
su
Swahili (Kiswahili)
sw
Swedish
sv
Tagalog
tl
Tajik
tg
Tamil
ta
Tatar
tt
Telugu
te
Thai
th
Tibetan
bo
Tigrinya
ti
Tonga
to
Tsonga
ts
Turkish
tr
Turkmen
tk
Twi
tw
Uighur
ug
Ukrainian
uk
Urdu
ur
Uzbek
uz
Vietnamese
vi
Volapük
vo
Welsh
cy
Wolof
wo
Xhosa
xh
Yiddish
yi
Yoruba
yo
Zulu
zu

 

Related Topics

 

 

An important note for developers of UTF-8 decoding routines: For security reasons, a UTF-8 decoder must not accept UTF-8 sequences that are longer than necessary to encode a character. For example, the character U+000A (line feed) must be accepted from a UTF-8 stream only in the form 0x0A, but not in any of the following five possible overlong forms:

  0xC0 0x8A
  0xE0 0x80 0x8A
  0xF0 0x80 0x80 0x8A
  0xF8 0x80 0x80 0x80 0x8A
  0xFC 0x80 0x80 0x80 0x80 0x8A

Any overlong UTF-8 sequence could be abused to bypass UTF-8 substring tests that look only for the shortest possible encoding. All overlong UTF-8 sequences start with one of the following byte patterns:

 

1100000x (10xxxxxx)
11100000 100xxxxx (10xxxxxx)
11110000 1000xxxx (10xxxxxx 10xxxxxx)
11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx)
11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx)

Also note that the code positions U+D800 to U+DFFF (UTF-16 surrogates) as well as U+FFFE and U+FFFF must not occur in normal UTF-8 or UCS-4 data. UTF-8 decoders should treat them like malformed or overlong sequences for safety reasons.

以上是关于Unicode Locale的主要内容,如果未能解决你的问题,请参考以下文章

Unicode详解

Unicode 范围以及python中生成所有Unicode的方法

unicode16与unicode32之间是啥转换关系

如何将unicode字符转换成汉字?

python unicode to str and str to unicode

unicode汉语翻译是啥意思