Navigation
Highlights

Release 0.0.5

The latest release, alpha testing, unstable. See "downloads".

2005-06-01

Links
Documentation

Unicode support library

The character set used by FreeDOS-32 is ISO-10646 (or UCS, the Universal Character Set), that is, since the two standards are practically identical, Unicode. Unicode lets FreeDOS-32 deal with all languages in the world providing a unique mapping for each character, without entering the national code pages arena. At present, FreeDOS-32 provides a simple but effective Unicode support through the Unicode Support library.

This document contains general information about the Unicode standard. It is not complete nor official, but rather an introduction to the features provided by the FreeDOS-32 Unicode Support Library. The official Unicode documentation is available through the Unicode Consortium web site.

Contents

Overview

The standard ASCII code, also known as ISO-646, providing mapping for only 128 characters with a 7-bit encoding, has revealed definitely insufficient to represent even a very limited range of languages. Many extensions of ASCII have been implemented, like a rich set of OEM code pages, ISO-8859 character sets and multi-byte encodings like the Japanese Shift JIS.

All of these extensions - the national code pages - are different and conflicting, so that sharing text encoded with them all around the world needs more or less difficult and accurate conversion procedures, which require knowledge of all the code pages involved.

Unicode (or ISO-10646, which is practically the same standard) solves this problem being a wide character set: in the Unicode table you have room for more than two millions characters, with no need of code page switching or escape sequences. A single character set to handle all the languages of the world. This totally superseedes national code pages, even if we are currently in a transition phase where they coexist with Unicode.

At the time of writing more than 130,000 characters have already been defined, but there is room left for much more. Each character is defined with a name (not a shape, or glyph) and a unique integer number called code point. For example Unicode defines LATIN CAPITAL LETTER A as U+0041, where 0041h is the code point. This is somewhat similar to the ASCII code, but without its range limitation.

In order to keep some backward compatibility, the first 128 code points are equivalent to the ASCII code, and the first 256 code points are equivalent to the ISO-8859-1 code (also known as Latin-1). Next, you find Cyrillic, Greek, Arabic, Chinese, Japanese, symbols and more, all in the same character set.

The actual way code points are encoded in binary numbers for computer use is described in the Unicode Encodings section.

Encodings

In order to use Unicode on real hardware, each code point needs to be mapped to an actual string of bits. The big range of code points needs not just more than 8 bits, but even more than 16 bits (which were the original Unicode idea).

Three different encodings have been defined for Unicode code points: UTF-8, UTF-16 (similar to ISO's UCS-2) and UTF-32 (same as ISO's UCS-4). They are absolutely equivalent and interchangeable, but allow implementors to represent strings as arrays of 8-bit, 16-bit or 32-bit elements respectively, according to their needs.

UTF-8

UTF-8 is a variable length encoding with 8-bit basis. This means that a single character may be represented by one or more bytes, up to six (although up to four in practice).

Basically, the most significant bits of a byte are used to tell whether the following byte is still part of a previous character or is the first byte of the following character. In this way, strings containing the first 128 code points appear to be encoded as standard ASCII strings. This is very comfortable for using the standard char C type and much of the standard C string functions such as strcmp, strcpy and strlen.

Actually, the mapping between code points and UTF-8 is a bit more complex than explained above. Using binary numbers:

code point                   1st byte  2nd byte  3rd byte  4th byte
00000000 00000000 0xxxxxxx - 0xxxxxxx
00000000 00000yyy yyxxxxxx - 110yyyyy  10xxxxxx
00000000 zzzzyyyy yyxxxxxx - 1110zzzz  10yyyyyy  10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx - 11110uuu  10uuzzzz  10yyyyyy  10xxxxxx

UTF-16

UTF-16 is a variable length encoding with 16-bit basis. This means that a single character may be represented by one or two 16-bit words.

UTF-16 is faster than UTF-8, as a lot of code points are accessible with a single word without any processing, but needs dedicated functions for string manipulation and produces bigger strings if ASCII characters are the most used. Somebody use the name UTF-16-LE for little endian values and UTF-16-BE for big endian values.

Follows the mapping between code points and UTF-16 using binary numbers:

code point                   1st word           2nd word
00000000 zzzzyyyy yyxxxxxx - zzzzyyyy yyxxxxxx
000uuuuu zzzzyyyy yyxxxxxx - 110110ww wwzzzzyy  110111yy yyxxxxxx
where wwww = uuuuu - 1.

UTF-32 or UCS-4

UTF-32 (or UCS-4 using the ISO name) is a fixed length encoding with 32-bit basis. This means that each character is represented by one 32-bit integer. Code points are mapped to 32-bit integers unchanged. This is the fastest encoding, but also produces very big strings. I see UTF-32 well suited for intermediate processing. Endianness issues are present as for UTF-16.

FreeDOS-32 and Unicode

FreeDOS-32 borns to be a DOS version supporting modern technologies. Supporting several languanges is crucial in today computing, thus, supporting Unicode as native character set since the beginning seems to be the right solution.

This also comes from guideline 1. simplicity, described in our System Specification, as it removes any code page management code from the kernel. If applications need to process strings using specific encodings, support for different code pages can be implemented with loadable modules, taking advantage of the underlying Unicode support. The NLS manager is a module for FreeDOS-32 that adds national code page support.

Several file systems (first of all FAT, the standard DOS file system) now store file names in Unicode. So at least the FreeDOS-32 file system should have some knowledge of Unicode in order to parse these strings.

At kernel level it is very likely that FreeDOS-32 will often face with ASCII strings. For simplicity it seems reasonable to use UTF-8 as native encoding for strings manipulated by the FD32 kernel, in order to use much of the existing support for string manipulation. UTF-8 strings can be managed almost as they were regular ASCII strings, provided that special services are used when processing a whole character is required. This is the case when you need to skip to the following character, or change the case of a character, or perform a case insensitive comparison. Finally, UTF-8 does not depend on the endianness of a system.

Unicode Support Library reference

Documentation for the FreeDOS-32 Unicode Support Library is embedded as Doxygen comments in the source code. Using the Doxygen tool it is possible to generate on-line documentation, like the following HTML pages.

On-line documentation