Native Language Support
Last update: 2004-10-04
Copyright © 2004 Salvatore Isaja
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included by reference.
ISO-10646 (or, that is practically the same, Unicode) is the character set chosen as a reference for character encoding in FreeDOS-32. Being an universal character set, it solves the problem of having several language-dependent code pages. But, for compatibility, FreeDOS-32 has to dial with them in some cases: for example the FAT file system stores long file names in Unicode, but short names are still encoded in a national character set. The Native Language Support (NLS) manager is an optional loadable module which provides support for handling national code-pages.
Contents
Code page modules and the NLS manager
This section explains code page modules and how to use them through the NLS manager.
Code page modules
There are a lot of different code pages in the world, there are DOS OEM codepages (like the famous CP437, CP850 and many others), Windows ANSI code pages (CP1252, the so called "Windows Latin-1", and similar), ISO 8859 code pages and so on.
It is not convenient writing a single NLS module providing support for all of them at once. For FreeDOS-32 we have done like many other systems, that is write a loadable module for each code page. In this way, a user can load only the code page modules useful for her country or application.
Basically, each code page module knows the code page internal, and particularly how to translate between the code page encoding and ISO-10646/Unicode. Each code page module exposes its features through a variadic request function and a structure of operations (these are explained in the Drivers Specification).
The NLS manager
The FreeDOS-32 NLS manager is for centralized control over the loaded code page modules. The NLS manager owns a list of registered code pages and provides functions (explained below) to register, unregister and query for a code page.
During the initialization (in the module init function, see the Drivers Specification), code page modules register themselves to the NLS manager list. After that, the kernel, processes or other modules and drivers can query the NLS manager for a code page and use the facilities of code page module through NLS operations.
Multibyte character sets
The ISO-646 code, that is the famous ASCII code, is a 7-bit code, thus it can represent 128 different characters. What code page usually do is extending the code to 8-bit and use the new 128 values to encode national characters. For most character sets, in facts, a single octet is enough to represent all national characters.
There are some languages, like Chinese and Japanese, that needs much more than 256 symbols. For these languages, multibyte character sets have been introduced. A single character is composed by one or more bytes (usually one or two), and byte strings must be parsed accordingly.
This being the scenario, code page modules must be able to handle multibyte
character sets as well as single byte characters. For this reason, unless explicitly
stated, following in this text the term "multibyte" is used to mean
the character set defined by a national code page. For example, the function mbtowc,
that means "multibyte to wide character", converts a character encoded in a national
character set to its Unicode equivalent, but the character has not to be multibyte.
Code page structure (struct nls_code_page)
Code page modules register a structure of the following type to the NLS manager.
The NLS manager stores the installed (or registered) code pages in a linked list.
The same code page may have more than one code page structure registered in the
linked list, to provide aliases for code page names, if needed. Each code page
module, however, can only manage one code page, thus a parameter like the
C++ this is not needed when calling the request function or the
NLS operations.
struct nls_code_page
{
const char *name;
int (*request)(int function, ...);
struct nls_code_page *next;
};
Where:
nameis a string encoded in Unicode UTF-8 that uniquely identifies the code page. Name comparison is performed disregarding case for characters in ASCII range (that is, 'a'..'z'), so that the simple implementation ofstrcasecmpcan be used, without knowledge of the UTF-8 encoding;requestis the address of the request function of the code page driver. See the Drivers Specification for more informations about the request function;nextis a pointer to the next code page structure in the linked list owned by the NLS manager. It must be initialized to a null pointer when registering a new code page, and ignored as reserved afterwards.
Code page drivers should provide statically allocated structures for struct
nls_code_page and statically allocated strings for name in order
to reduce memory fragmentation.
NLS operations
Code page drivers expose their functionality through a structure
of operations called struct nls_operations. See the
Drivers Specification for more
informations about structures of operations.
struct nls_operations
{
int (*mbtowc) (wchar_t *restrict result,
const char *restrict string,
size_t size);
int (*wctomb) (char *string, wchar_t wchar, size_t size);
int (*mblen) (const char *string, size_t size);
int (*toupper)(int ch);
int (*tolower)(int ch);
int (*release)(void);
};
Method: mbtowc
Multibyte to wide character.
int mbtowc (wchar_t *restrict result,
const char *restrict string,
size_t size)
Description
This is similar to the standard libc function mbtowc. It converts
a single character (eventually multibyte) at the beginning of the source string,
encoded in the national code page, to a wide character (UCS-4 or UTF-32) and
stores the result in result. It examines no more than size byte
of the source string. The objects pointed by result and string should
not overlap.
Return value
On success, the wide character is stored in result and the length in bytes of the processed multibyte character is returned.
On error, the content of result is undefined and one of the following
negative code is returned:
-EINVAL you have passed a null pointer
as a parameter;
-ENAMETOOLONG size is too small to parse the
multibyte character.
Method: wctomb
Wide character to multibyte.
int wctomb (char *string, wchar_t wchar, size_t size)
Description
This is similar to the standard libc function wctomb. It converts a wide character
(UCS-4 or UTF-32) to a single character (eventually multibyte) encoded in the
national code page, storing the result at the beginning of string. Unlike
its libc counterpart, it does so only if the converted character fits in no more
than size bytes.
Return value
On success, the multibyte character is stored in string and its length in bytes is returned.
On error, the content of string is undefined and one of the following
negative code is returned:
-EINVAL you have passed a null pointer
as a parameter;
-EILSEQ the wide character cannot be encoded in this code page;
-ENAMETOOLONG size is too small to fit the
converted multibyte character.
Method: mblen
Gets the length of a multibyte character.
int mblen (const char *string, size_t size)
Description
This is similar to the standard libc function mblen. Gets the
length in bytes of a single character (eventually multibyte)
at the beginning of the source string, encoded in the national code page,
examining no more than size byte of the source string. It performs
a subset of the work performed by the mbtowc method.
Return value
On success, the length in bytes of the processed multibyte character is returned.
On error, one of the following
negative code is returned:
-EINVAL you have passed a null pointer
as a parameter;
-ENAMETOOLONG size is too small to parse the
multibyte character.
Remarks
You can use this function to know if a character is single byte or multibyte.
For example, you must be sure a character is single byte before calling the
toupper and tolower methods.
Method: toupper
Converts a single byte character to upper case.
int toupper (int ch)
Description
If the single byte character ch, interpreted as an unsigned
char, is a lower case letter, this function converts to its corresponding
upper case letter. If ch is a byte that is part of a multibyte character,
the result is undefined. You can use the mblen method to know if a character
is single byte or multibyte.
Return value
A byte corresponding to the upper case version of ch, if available, otherwise ch is returned unchanged.
Method: tolower
Converts a single byte character to lower case.
int tolower (int ch)
Description
This is identical to the toupper method, but it converts ch to
lower case. Please refer to toupper for the details.
NLS manager functions
The NLS manager stores the installed (or registered) code pages in a linked list. To interact with this linked list a set of functions is provided. These functions are registered in the symbol table of the FD32 kernel and thus calling them results in in a near call when the caller is dynamically linked with the kernel. Also, this means that in order to use the functions of the NLS manager, applications or other modules must be loaded after the NLS manager module.
Function: nls_get_operations
Gets the structure of operations for a code page.
int nls_get_operations (const char *name, int type, void **operations)
Description
Gets the specified type of structure of operations for the code page identified by name, storing a pointer to that structure in operations.
The type parameter
is intended for future enhancements to the NLS operations interface. At present
only OT_NLS_OPERATIONS is valid. This functions
uses the REQ_GET_OPERATIONS command of request function of the code
page. The operations argument can be a null pointer if you only need to
check if the specified type of operations is available.
The code page name must match the name of a registered code page. Name
comparison is performed by the simple strcasecmp, thus it is case
insensitive for characters in the ASCII range (that is, 'a'..'z').
Return value
On success, returns zero and stores the pointer to the structure of operations in operations.
On error, returns one of the following and the content of operations is
undefined:
-EINVAL you have passed a null pointer for name;
-ENOTSUP the specified operations type is not supported by the code page
module; this can also mean that the underlying request function does not support the
REQ_GET_OPERATIONS command, but This Should Not Happen.
Function: nls_register_codepage
Registers a code page to the NLS manager.
int nls_register_codepage (struct nls_code_page *cp)
Description
Registers the specified code page structure linking it at the head (for performance
reasons) of the list of code pages. The next field of the code page structure
is updated accordingly. After registering the code page it is possible for users
to get NLS operations using nls_get_operations.
The code page structure must be properly initialized: name must
contain a valid pointer to a name string, request must contain the
valid address of the request function of the code page module and next must
be a null pointer.
Return value
Zero on success. -EBUSY if the code page structure is already
linked (the
next field is not null). -EINVAL if the code page structure
is not valid (either the
name field or the request function pointer are null
pointers) or cp is a null pointer.
Function: nls_unregister_codepage
Unregisters a code page from the NLS manager.
int nls_unregister_codepage (struct nls_code_page *cp)
Description
Unregisters the specified code page structure unlinking it from the list
of code pages. After unregistering the code page structure, its next field
is a null pointer and the code page can not be used any longer. If there are
references to the code page, that is some user has not released the structure
of operations of the code page, the code page structure can not be unregistered.
Return value
Zero on success. -EBUSY if the code page structure is in use
(its reference counter is greater than zero). -EINVAL if
cp is not the address of a registered code page structure (this includes
the case when cp is a null pointer).