reosoftproductions.com
RODNEY AND ARLYN'S WEB SITE
x

The Concept of Code Pages

Code Pages

Definitions

A code page is a table of values that describes the character set used for encoding a particular set of glyphs, usually combined with a number of control characters.

Character encoding is used to represent a repertoire of characters by some kind of an encoding system.

A glyph is an elemental symbol within an agreed set of symbols intended to represent a readable character for the purposes of writing.

A control character or non-printing character is a code point, a number, in a character set, that does not represent a written symbol. They are used as in-band signaling to cuase effects other than the addition of a symbol to the text.

Nowadays, most locales are UTF-8 based which means characters can take up from 1 to 6 bytes. When dealing with data that is meant to be bytes, with text utilities, you'll want to set LC_ALL=C. It will also improve performance significantly because parsing UTF-8 data has a cost.

Unicode

SQL Server

Unicode is a standard for mapping code points to characters. Because it is designed to cover all the characters of all the languages of the world, there is no need for different code pages to handle different sets of characters. If you store character data that reflects multiple languages, always use Unicode data types (nchar, nvarchar, and ntext) instead of the non-Unicode data types (char, varchar, and text).

Significant limitations are associated with non-Unicode data types. This is because a non-Unicode computer will be limited to use of a single code page. You might experience performance gain by using Unicode because fewer code-page conversions are required. Unicode collations must be selected individually at the database, column or expression level because they are not supported at the server level.

The code pages that a client uses are determined by the operating system settings. To set client code pages on the Windows operating system, use Regional Settings in Control Panel.

Code Pages

Linux

To determine the active code page the system is running on Linux, run:

locale

The output will look like:

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

Some applications use the following variables:

LC_ALL
LC_CTYPE
LANG

When you set LC_ALL, the following variables are all set:

LC_COLLATE LC_CTYPE LC_MESSAGES LC_MONETARY LC_NUMERIC LC_TIME LANG

LANG

This variable determines the locale category for native language, local customs and coded character set in the absence of the LC_ALL and other LC_* (LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, LC_TIME) environment variables. This can be used by applications to determine the language to use for error messages and instructions, collating sequences, date formats, and so forth.

LANG=C

The C locale is a special locale that is meant to be the simplest locale. You could also say that while the other locales are for humans, the C locale is for computers. In the C locale, characters are single bytes, the charset is ASCII (well, is not required to, but in practice will be in the systems most of us will ever get to use), the sorting order is based on the byte values, the language is usually US English (though for application messages (as opposed to things like month or day names or messages by system libraries), it's at the discretion of the application author) and things like currency symbols are not defined.