From owner-freebsd-hackers  Thu Jun 11 04:42:02 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id EAA26650
          for freebsd-hackers-outgoing; Thu, 11 Jun 1998 04:42:02 -0700 (PDT)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from coconut.itojun.org (root@coconut.itojun.org [210.160.95.97])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id EAA26529
          for <hackers@freebsd.org>; Thu, 11 Jun 1998 04:40:48 -0700 (PDT)
          (envelope-from itojun@itojun.org)
Received: from localhost (itojun@localhost.itojun.org [127.0.0.1])
	by coconut.itojun.org (8.8.8+3.0Wbeta12/3.6W) with ESMTP id UAA20422;
	Thu, 11 Jun 1998 20:33:33 +0900 (JST)
To: Konstantin Chuguev <joy@urc.ac.ru>
cc: Gary Kline <kline@tao.thought.org>, Terry Lambert <tlambert@primenet.com>,
        hackers@FreeBSD.ORG
In-reply-to: joy's message of Thu, 11 Jun 1998 15:00:16 +0600.
      <357F9CA0.F8F1DD61@urc.ac.ru> 
X-Template-Reply-To: itojun@itojun.org
X-Template-Return-Receipt-To: itojun@itojun.org
X-PGP-Fingerprint: F8 24 B4 2C 8C 98 57 FD  90 5F B4 60 79 54 16 E2
Subject: Re: internationalization 
From: Jun-ichiro itojun Itoh <itojun@iijlab.net>
Date: Thu, 11 Jun 1998 20:33:33 +0900
Message-ID: <20418.897564813@coconut.itojun.org>
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


>I see. Suppose it was made for saving space in the code table.
>And now, without external information about the language of the text,
>no one can properly render hieroglyphs.
>And I see ISO 2022 solves this problem for a plain text.
>But, although text/plain is very suitable for Email messages, for
>example,
>it is very difficult to index/search such documents without additional
>information (at least about language used), as different languages
>have different rules for sorting their letters/glyphs. Searching
>in multilingual documents is even more painful.
>How it can be realized with ISO 2022?
>I still think a flat character set table has many advantages in this
>case.
>Plus, as I said before, large database of each character's
>characteristics in Unicode.

	Handling (searching/indexing) multilingual data and storing
	multilingual data can be done in separate method (and I prefer
	them to be orthogonal).

	IMHO, for storing information we must retain as much information as
	possible, so iso-2022 wins here (because it is fully multilingual,
	even from standpoint of asian language users).  For searching, there
	are several ways:
	1. Have some dictionary, or regular expressions, to unify the item
	   to be searched.  For example, following regular expression should
	   match the all occurance that means "data".
		(data|datum)
	   We can do this for multiple languages.
	2. Have canonical form, just for handling/searching.  This can be
	   Unicode maybe, or this can be wchar_t (rune_t for xpg4).
	   Convert the source into canonical form, perform search/index
	   over the canonical form, get the result, and dump the text
	   in canonical form.

	If you store the original information using a format that unifies
	part of information in the source (e.g. Unicode) you'll lose some of
	the very important part in the file, and the lossage will not be
	recovered.
	For example, if you convert all the file you have into uppercase
	for searching, you'll never recover the uppercase/lowercase
	information.  Unicode's unification is quite similar to this,
	for asian language speakers (especially multilingual-targetted
	people).

	xpg4 (runelocale) library provides a beautiful way of establishing
	(2) in the above.
	You can have a source file with ANY encoding you prefer.  If you
	set environment variable LANG (setenv LANG=ja_JP.EUC), rune library
	will convert everything into wchar_t on read, via functions like
	fgetrune().  Your program will take care of wchar_t only, and
	you can output the result in the original encoding via fputrune().
	The beauty here is, the mapping between the source file and
	wchar_t can be switched by environment variable LANG.  It is not
	fixed, so we can be open about the internal encoding of wchar_t.
	Currently implemented xpg4 library uses 16bit UCS2 for LANG=UTF2,
	and 16bit packed EUC form for LANG=ja_JP.EUC.  My library
	uses 32bit packed form for importing iso-2022 encoded string into
	32bit wchar_t.

>I don't want to say we should stop using ISO 2022. I just want to say
>we shouldn't stop (should start) using Unicode. I.e. to use both
>of them, as both have their advantages and disadvantages.

	Yes, I agree that Unicode can be useful in some places.  But I
	do not like Unicode be the encoding for data sources (and Unicode
	tend to be stressed toward that).  That way important portion of
	the information will be lost.

>>         Of course my library support both of them.  If you say
>>         setrunelocale("UTF2"), the internal and external representation
>>         will be come Unicode.  If you say setrunelocale("ja_JP.iso-2022-jp")
>>         it will be come Japanese iso-2022-jp encoding.
>>         I'll try to release my library with sample application sooner.
>>         I think I can give you the tarball at New Olreans :-)
>Great.
>What about conversion?

	For conversion, there seems to be a standard function defined such as
	iconv(3) or iconv_open(3).  I'm thinking of implementing this, but it
	requires me to have a giant table, such as:
		iso-2022<->unicode with japanese gryphs
		iso-2022<->unicode with korean gryphs
		iso-2022<->unicode with chinese gryphs
		and more...
	somewhere in the filesystem.

>Having an internationalized OS still require the ability of the user
>to comunicate with other, non-internationalized parties with 8-bit
>or other character sets.

	I maybe not getting what you mean here...

	For tagging encoding method we have charset parameter for
	Content-type: MIME header If charset parameter is incompatible
	mailer can notify the user of the incompatibility.  Also there's
	multipart/alternative MIME multipart so that the same content
	with multiple encoding can be transmitted.
	We must also have a way to restrict some text to conform to
	some defined charset (say, charset=iso-2022-jp).

	Or, do you mean how to literally convert Japanese/Chinese into ASCII?
	Yes, there are several ways.  Such as ROMA-JI for Japanese
	(I can write Japanese words in ASCII: "Fujiyama" "Geisha" "Sushi"
	"Harakiri"), or ping-ying for Chinese (correct?).

itojun

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message