Navigation
Highlights

Release 0.0.5

The latest release, alpha testing, unstable. See "downloads".

2005-06-01

Links
Documentation

Native Language Support

ISO-10646 (or, that is practically the same, Unicode) is the character set chosen as a reference for character encoding in FreeDOS-32. Being an universal character set, it solves the problem of having several language-dependent code pages. But, for compatibility, FreeDOS-32 has to dial with them in some cases: for example the FAT file system stores long file names in Unicode, but short names are still encoded in a national character set. The Native Language Support (NLS) manager is an optional loadable module which provides support for handling national code-pages.

Contents

Code page modules and the NLS manager

This section explains code page modules and how to use them through the NLS manager.

Code page modules

There are a lot of different code pages in the world, there are DOS OEM codepages (like the famous CP437, CP850 and many others), Windows ANSI code pages (CP1252, the so called "Windows Latin-1", and similar), ISO 8859 code pages and so on.

It is not convenient writing a single NLS module providing support for all of them at once. For FreeDOS-32 we have done like many other systems, that is write a loadable module for each code page. In this way, a user can load only the code page modules useful for her country or application.

Basically, each code page module knows the code page internal, and particularly how to translate between the code page encoding and ISO-10646/Unicode. Each code page module exposes its features through a variadic request function and a structure of operations (these are explained in the Drivers Specification).

The NLS manager

The FreeDOS-32 NLS manager is for centralized control over the loaded code page modules. The NLS manager owns a list of registered code pages and provides functions (explained below) to register, unregister and query for a code page.

During the initialization (in the module init function, see the Drivers Specification), code page modules register themselves to the NLS manager list. After that, the kernel, processes or other modules and drivers can query the NLS manager for a code page and use the facilities of code page module through NLS operations.

Multibyte character sets

The ISO-646 code, that is the famous ASCII code, is a 7-bit code, thus it can represent 128 different characters. What code page usually do is extending the code to 8-bit and use the new 128 values to encode national characters. For most character sets, in facts, a single octet is enough to represent all national characters.

There are some languages, like Chinese and Japanese, that needs much more than 256 symbols. For these languages, multibyte character sets have been introduced. A single character is composed by one or more bytes (usually one or two), and byte strings must be parsed accordingly.

This being the scenario, code page modules must be able to handle multibyte character sets as well as single byte characters. For this reason, unless explicitly stated, following in this text the term "multibyte" is used to mean the character set defined by a national code page. For example, the function mbtowc, that means "multibyte to wide character", converts a character encoded in a national character set to its Unicode equivalent, but the character has not to be multibyte.

Code page structure (struct nls_code_page)

Code page modules register a structure of the following type to the NLS manager. The NLS manager stores the installed (or registered) code pages in a linked list. The same code page may have more than one code page structure registered in the linked list, to provide aliases for code page names, if needed. Each code page module, however, can only manage one code page, thus a parameter like the C++ this is not needed when calling the request function or the NLS operations.

struct nls_code_page
{
    const char *name;
    int (*request)(int function, ...);
    struct nls_code_page *next;
};

Where:

  • name is a string encoded in Unicode UTF-8 that uniquely identifies the code page. Name comparison is performed disregarding case for characters in ASCII range (that is, 'a'..'z'), so that the simple implementation of strcasecmp can be used, without knowledge of the UTF-8 encoding;
  • request is the address of the request function of the code page driver. See the Drivers Specification for more informations about the request function;
  • next is a pointer to the next code page structure in the linked list owned by the NLS manager. It must be initialized to a null pointer when registering a new code page, and ignored as reserved afterwards.

Code page drivers should provide statically allocated structures for struct nls_code_page and statically allocated strings for name in order to reduce memory fragmentation.

NLS operations

Code page drivers expose their functionality through a structure of operations called struct nls_operations. See the Drivers Specification for more informations about structures of operations.

struct nls_operations
{
    int (*mbtowc) (wchar_t *restrict result,
                   const char *restrict string,
                   size_t size);
    int (*wctomb) (char *string, wchar_t wchar, size_t size);
    int (*mblen)  (const char *string, size_t size);
    int (*toupper)(int ch);
    int (*tolower)(int ch);
    int (*release)(void);
};

Method: mbtowc

Multibyte to wide character.

int mbtowc (wchar_t *restrict result,
            const char *restrict string,
            size_t size)
Description

This is similar to the standard libc function mbtowc. It converts a single character (eventually multibyte) at the beginning of the source string, encoded in the national code page, to a wide character (UCS-4 or UTF-32) and stores the result in result. It examines no more than size byte of the source string. The objects pointed by result and string should not overlap.

Return value

On success, the wide character is stored in result and the length in bytes of the processed multibyte character is returned.

On error, the content of result is undefined and one of the following negative code is returned:
-EINVAL you have passed a null pointer as a parameter;
-ENAMETOOLONG size is too small to parse the multibyte character.

Method: wctomb

Wide character to multibyte.

int wctomb (char *string, wchar_t wchar, size_t size)
Description

This is similar to the standard libc function wctomb. It converts a wide character (UCS-4 or UTF-32) to a single character (eventually multibyte) encoded in the national code page, storing the result at the beginning of string. Unlike its libc counterpart, it does so only if the converted character fits in no more than size bytes.

Return value

On success, the multibyte character is stored in string and its length in bytes is returned.

On error, the content of string is undefined and one of the following negative code is returned:
-EINVAL you have passed a null pointer as a parameter;
-EILSEQ the wide character cannot be encoded in this code page;
-ENAMETOOLONG size is too small to fit the converted multibyte character.

Method: mblen

Gets the length of a multibyte character.

int mblen (const char *string, size_t size)
Description

This is similar to the standard libc function mblen. Gets the length in bytes of a single character (eventually multibyte) at the beginning of the source string, encoded in the national code page, examining no more than size byte of the source string. It performs a subset of the work performed by the mbtowc method.

Return value

On success, the length in bytes of the processed multibyte character is returned.

On error, one of the following negative code is returned:
-EINVAL you have passed a null pointer as a parameter;
-ENAMETOOLONG size is too small to parse the multibyte character.

Remarks

You can use this function to know if a character is single byte or multibyte. For example, you must be sure a character is single byte before calling the toupper and tolower methods.

Method: toupper

Converts a single byte character to upper case.

int toupper (int ch)
Description

If the single byte character ch, interpreted as an unsigned char, is a lower case letter, this function converts to its corresponding upper case letter. If ch is a byte that is part of a multibyte character, the result is undefined. You can use the mblen method to know if a character is single byte or multibyte.

Return value

A byte corresponding to the upper case version of ch, if available, otherwise ch is returned unchanged.

Method: tolower

Converts a single byte character to lower case.

int tolower (int ch)
Description

This is identical to the toupper method, but it converts ch to lower case. Please refer to toupper for the details.

NLS manager functions

The NLS manager stores the installed (or registered) code pages in a linked list. To interact with this linked list a set of functions is provided. These functions are registered in the symbol table of the FD32 kernel and thus calling them results in in a near call when the caller is dynamically linked with the kernel. Also, this means that in order to use the functions of the NLS manager, applications or other modules must be loaded after the NLS manager module.

Function: nls_get_operations

Gets the structure of operations for a code page.

int nls_get_operations (const char *name, int type, void **operations)

Description

Gets the specified type of structure of operations for the code page identified by name, storing a pointer to that structure in operations.

The type parameter is intended for future enhancements to the NLS operations interface. At present only OT_NLS_OPERATIONS is valid. This functions uses the REQ_GET_OPERATIONS command of request function of the code page. The operations argument can be a null pointer if you only need to check if the specified type of operations is available.

The code page name must match the name of a registered code page. Name comparison is performed by the simple strcasecmp, thus it is case insensitive for characters in the ASCII range (that is, 'a'..'z').

Return value

On success, returns zero and stores the pointer to the structure of operations in operations.

On error, returns one of the following and the content of operations is undefined:
-EINVAL you have passed a null pointer for name;
-ENOTSUP the specified operations type is not supported by the code page module; this can also mean that the underlying request function does not support the REQ_GET_OPERATIONS command, but This Should Not Happen.

Function: nls_register_codepage

Registers a code page to the NLS manager.

int nls_register_codepage (struct nls_code_page *cp)

Description

Registers the specified code page structure linking it at the head (for performance reasons) of the list of code pages. The next field of the code page structure is updated accordingly. After registering the code page it is possible for users to get NLS operations using nls_get_operations.

The code page structure must be properly initialized: name must contain a valid pointer to a name string, request must contain the valid address of the request function of the code page module and next must be a null pointer.

Return value

Zero on success. -EBUSY if the code page structure is already linked (the next field is not null). -EINVAL if the code page structure is not valid (either the name field or the request function pointer are null pointers) or cp is a null pointer.

Function: nls_unregister_codepage

Unregisters a code page from the NLS manager.

int nls_unregister_codepage (struct nls_code_page *cp)

Description

Unregisters the specified code page structure unlinking it from the list of code pages. After unregistering the code page structure, its next field is a null pointer and the code page can not be used any longer. If there are references to the code page, that is some user has not released the structure of operations of the code page, the code page structure can not be unregistered.

Return value

Zero on success. -EBUSY if the code page structure is in use (its reference counter is greater than zero). -EINVAL if cp is not the address of a registered code page structure (this includes the case when cp is a null pointer).