Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 24 Aug 2000 17:39:39 +0700 (ALMST)
From:      Boris Popov <bp@butya.kz>
To:        freebsd-arch@freebsd.org, freebsd-i18n@freebsd.org, Konstantin Chuguev <Konstantin.Chuguev@dante.org.uk>
Subject:   Proposal to include iconv library in the base system.
Message-ID:  <Pine.BSF.4.10.10008241719320.80086-100000@lion.butya.kz>

next in thread | raw e-mail | index | archive | help
[This message cc'ed to -i18n which has had zero activity in the last month.]

Proposal to include iconv library and iconv(1) program in the base system.

This library of functions and its companion iconv program provide
converts between various single-byte and multibyte charsets.  These iconv* 
functions are essential in the mixed networks and on local machines with 
multiple charsets.

FreeBSD already contains a few character conversion schemes for
msdosfs, nwfs, cd9660fs and syscon mapping tables.  However, the usage
of these tables is not standardized and only providing support for a small
number of character sets.  Many external packages like KDE and GNOME also 
rely on the iconv functions.

Konstantin Chuguev wrote the original code in BSDL and I modified it
slightly.

OpenGroup has a description of iconv functions online:
http://www.opengroup.org/onlinepubs/7908799/xsh/iconv.html

A brief overview of character sets is available at:
http://www.austin.ibm.com/doc_link/en_US/a_doc_lib/aixprggd/genprogc/codeset_over.htm


Short Introduction on Library Design and Implementation:

The library consists of a core part, the Character Encoding Scheme (CES)
modules and the Character Conversion Scheme (CCS) modules.  Core part
contains exposed user functions and the internal framework for modules.  
To provide the maximum number of supported character set combinations,
this library uses unicode as the intermediate charset.  CES and CCS
modules contains conversion logic and conversion tables to map characters
between unicode and the target charset.  The entire character conversion
process looks like this:

	charset1 -> unicode -> charset2

In addition, it is possible to perform conversion only to/from unicode.

Modules are implemented as shared libraries and loaded via the dlopen()
function.  Modules reside in the /usr/lib/iconv/ directory and can be
dynamically added to system.

To make iconv subsystem more flexible, it has a "converter" layer
which allows the addition of more various converters.  Given
two arbitrarily chosen charsets charset1 and charset2, the converter 
allows programs to "open,"  then to perform conversion, and to close the 
process while release resources. For now library have only so called
Unicode Converter (UC).

For example, it is possible to write a XLAT converter which will
support direct, table based conversion between known characters sets. Of
course, a new converter can use its own modules.

Since support for multiple characters sets is also required in the
kernel, there is a kernel part which provides nearly the same set of
function in the kernel space.  Conversion tables uploaded to kernel memory
via sysctl interface from corresponding userland modules (no code, only
data).

The questionable part is a which set of character sets should be
included in the base system and which should be supplied as packages.
Obviously, conversion tables occupy 99% of the space:

	Part Name		Size of source code
	---------		-------------------
	Libray					83K

	Base character sets		       218K
	(ISO-8859*, cp8??, windows-125?)

	CJK				      5548K
	(big5, cns*, gb*, jis*, cp9??)

	RFC1345 character sets		      1064K

	Unicode character sets		       711K
	-------------------------------------------


Secondly, where should the functions be placed? Initially, the iconv
library was a separate file (libiconv*).  However, it seems that
Solaris has the library in libc and Linux in glibc.  I do not
know how HPUX does this.

And the third question is where I should place the source code for 
character conversion schemes in the source tree.

Of course, to respect maintaners of embedded systems and those who have
to deal with only one charset, option 'NO_ICONV' will be prvoided.

I would appreciate any feedback on this topic.

P.S. sources of libiconv in its current state available at
http://www.butya.kz/~bp/inode/

-------------------------------------------- 

Michael C. Wu (keichii@iteration.net) reviewed my proposal before it has
been posted and made some comments:

1) Does this allow for small patchsets to the character tables?
   i.e.  UNICODE does not completely map to BIG-5. 
         Some implementations map the differences directly to 
         blank space, while others map to equivalent characters.
         Depending on the user's choice, one should be able
         the specify a small change to the charset table without
         disruption.

I think it is possible and author probably already know the way :)

2) We should include all EUC and ISO charsets, even if they are
   sometimes totally unused to conform to standards.

3) I suggest having the character tables in /usr/libdata.
   /usr/share should have a directory that contains the mappings also.
   For the kernel, perhaps we should have a src/sys/i18n 
   and put iconv into src/sys/i18n/iconv.
   As to the libc code, to avoid ports compiling and patching
   trouble, we should follow what linux does in libc.

4) I do not think the charset tables will be bigger than 15mb total.
	

--
Boris Popov
http://www.butya.kz/~bp/



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.10.10008241719320.80086-100000>