On Tue, 31 Jul 2001, Manish Jethani wrote:
BTW what are the chars with the 8th bit set (128-255) doing? Is that by any means standard? I've seen fonts having the (C) and (TM) symbols lurking somewhere in that range.
Manish J.
Before ISO-10646 standard came, there were ISO-8859-* standards. In each of these standard the range 0-127 was reserved for ASCII characters and range 128-256 was reserved for other foreign language characters. For example, the following codesets are defined:
8859-1 - Europe, Latin America (also known as Latin 1) 8859-2 - Eastern Europe 8859-5 - Cyrillic 8859-8 - Hebrew
This way, if a codeset have copyright or trademark character then it will fall in the extended-ASCII range (128-256). These ISO-8859-* standards are very much similar to our ISCII (Indian Script Code for Information Interchange) where the range 128-256 is kept reserved for various Indic scripts. When a user select Hindi then Hindi characters take their place in the range and if the user select Tamil then Tamil characters take their place in the same range. The advantage of this scheme (ISCII as well as ISO-8859-*) is that trasliteration from one script to another script is possible directly. However there is a disadvantage that not more than two scripts can be displayed simultaneously because of overlapping of character codes.
The international standard ISO-10646 defines the Universal Character Set (UCS). UCS is a superset of all other character set standards. UCS contains the characters required to represent practically all known languages. Not all systems are expected to support all the advanced mechanisms of UCS such as combining characters. Therefore, ISO-10646 specifies the following three implementation levels:
Level 1 : Combining characters and Hangul Jamo characters (a special, more complicated encoding of the Korean script, where Hangul syllables are coded as two or three subcharacters) are not supported.
Level 2 : Like level 1, however in some scripts, a fixed list of combining characters is now allowed (e.g., for Hebrew, Arabic, Devangari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugo, Kannada, Malayalam, Thai and Lao). These scripts cannot be represented adequately in UCS without support for at least certain combining characters.
Level 3 : All UCS characters are supported, such that for example mathematicians can place a tilde or an arrow (or both) on any arbitrary character.
The Unicode Standard published by the Unicode Consortium contains exactly the ISO 10646-1 Basic Multilingual Plane at implementation level 3. All characters are at the same positions and have the same names in both standards.
The Unicode Standard defines in addition much more semantics associated with some of the characters and is in general a better reference for implementors of high-quality typographic publishing systems. Unicode specifies algorithms for rendering presentation forms of some scripts (say Arabic), handling of bi-directional texts that mix for instance Latin and Hebrew, algorithms for sorting and string comparison, and much more.
The ISO 10646 standard on the other hand is not much more than a simple character set table, comparable to the well-known ISO 8859 standard. It specifies some terminology related to the standard, defines some encoding alternatives, and it contains specifications of how to use UCS in connection with other established ISO standards such as ISO 6429 and ISO 2022. There are other closely related ISO standards, for instance ISO 14651 on sorting UCS strings.
- Keyur