From owner-freebsd-hackers  Thu Jun 11 16:08:18 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id QAA24390
          for freebsd-hackers-outgoing; Thu, 11 Jun 1998 16:08:18 -0700 (PDT)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from smtp03.primenet.com (daemon@smtp03.primenet.com [206.165.6.133])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id QAA24249
          for <hackers@freebsd.org>; Thu, 11 Jun 1998 16:07:56 -0700 (PDT)
          (envelope-from tlambert@usr09.primenet.com)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.8.8/8.8.8) id QAA16762;
	Thu, 11 Jun 1998 16:07:56 -0700 (MST)
Received: from usr09.primenet.com(206.165.6.209)
 via SMTP by smtp03.primenet.com, id smtpd016692; Thu Jun 11 16:07:48 1998
Received: (from tlambert@localhost)
	by usr09.primenet.com (8.8.5/8.8.5) id QAA00116;
	Thu, 11 Jun 1998 16:07:43 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199806112307.QAA00116@usr09.primenet.com>
Subject: Re: internationalization
To: itojun@iijlab.net (Jun-ichiro itojun Itoh)
Date: Thu, 11 Jun 1998 23:07:43 +0000 (GMT)
Cc: joy@urc.ac.ru, kline@tao.thought.org, tlambert@primenet.com,
        hackers@FreeBSD.ORG
In-Reply-To: <20418.897564813@coconut.itojun.org> from "Jun-ichiro itojun Itoh" at Jun 11, 98 08:33:33 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> 	Handling (searching/indexing) multilingual data and storing
> 	multilingual data can be done in separate method (and I prefer
> 	them to be orthogonal).

I prefer collation sequence to be handled in a seperate methoid, as well.


> 	IMHO, for storing information we must retain as much information as
> 	possible, so iso-2022 wins here (because it is fully multilingual,
> 	even from standpoint of asian language users).

ISO 2022 is inferior to SGML.


> 	If you store the original information using a format that unifies
> 	part of information in the source (e.g. Unicode) you'll lose some of
> 	the very important part in the file, and the lossage will not be
> 	recovered.

Not if you store the file in marked-up format, unless you are arguing
that you can't store SGML tags in Unicode, but you *can* store them in
ISO8859-1?

> 	For example, if you convert all the file you have into uppercase
> 	for searching, you'll never recover the uppercase/lowercase
> 	information.

This is why you compile your regular expressions: to save the expense
of the duplcation and conversion to avoid damaging the original data.


>       Unicode's unification is quite similar to this,
> 	for asian language speakers (especially multilingual-targetted
> 	people).

This is why there are round-trip character sets, and why there is still
locale information required.


> 	xpg4 (runelocale) library provides a beautiful way of establishing
> 	(2) in the above.
> 	You can have a source file with ANY encoding you prefer.  If you
> 	set environment variable LANG (setenv LANG=ja_JP.EUC), rune library
> 	will convert everything into wchar_t on read, via functions like
> 	fgetrune().

This is *NOT* beautiful, unless you are in the business of selling very
fast microprocessors to people who already own fast microprocessors.

Trading markup for storage encoding that doesn't match processing
encoding is a bad trade.  It increases processing overhead drastically
for no real gain.


> >I don't want to say we should stop using ISO 2022. I just want to say
> >we shouldn't stop (should start) using Unicode. I.e. to use both
> >of them, as both have their advantages and disadvantages.
> 
> 	Yes, I agree that Unicode can be useful in some places.  But I
> 	do not like Unicode be the encoding for data sources (and Unicode
> 	tend to be stressed toward that).  That way important portion of
> 	the information will be lost.

Not if you encode it in-band, like the standard says you are supposed
to do, using a markup language (preferaably a widely accepted standard,
such as SGML or the SGML DTD for RTF).


> 	For conversion, there seems to be a standard function defined such as
> 	iconv(3) or iconv_open(3).  I'm thinking of implementing this, but it
> 	requires me to have a giant table, such as:
> 		iso-2022<->unicode with japanese gryphs
> 		iso-2022<->unicode with korean gryphs
> 		iso-2022<->unicode with chinese gryphs
> 		and more...
> 	somewhere in the filesystem.

It is not required in the kernel, except in support of legacy systems
exporting FS's via NFS.  Even then, it's still not required to be in
the kernel, if you are willing to accept a latency penalty for accessing
legacy systems that you are too stubborn (or otherwise unable) to
upgrade like you should.


> >Having an internationalized OS still require the ability of the user
> >to comunicate with other, non-internationalized parties with 8-bit
> >or other character sets.
> 
> 	I maybe not getting what you mean here...

NFS systems running an ISO 8859-1 character set are the most common
deployed case.  The data stream needs to be attributed in the kernel.

It is very tempting to attribute files as "text" and convert only text
files.

For UTF2 (16 bit wchar_t process encoded Unicode or ISO 10646 0/16)
encoded files, it is trivial to expand one page to two pages when
memory mapping the file.

It is a hell of a lot less trivial to memory map a EUC encoded JIS-208
(or UTF-7/8 encoded Unicode) file using only code points 0x0000 to
0x00ff.  This is because there is not a fixed expansion/contraction
ratio, nor is there a mechanism for faultin on non-page boundries in
most modern processors.

I'm not willing to give up the ability to memory map files that
contain text.


> 	Or, do you mean how to literally convert Japanese/Chinese into ASCII?
> 	Yes, there are several ways.  Such as ROMA-JI for Japanese
> 	(I can write Japanese words in ASCII: "Fujiyama" "Geisha" "Sushi"
> 	"Harakiri"), or ping-ying for Chinese (correct?).

Or you can convert the Japanese words into Katakana, instead, so long
as you are willing to use the appropriate 8-bit character set, and
forsake Kanji to do it.  The sword cuts both directions.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message