From owner-freebsd-hackers Thu Jun 11 16:08:18 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id QAA24390 for freebsd-hackers-outgoing; Thu, 11 Jun 1998 16:08:18 -0700 (PDT) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from smtp03.primenet.com (daemon@smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id QAA24249 for ; Thu, 11 Jun 1998 16:07:56 -0700 (PDT) (envelope-from tlambert@usr09.primenet.com) Received: (from daemon@localhost) by smtp03.primenet.com (8.8.8/8.8.8) id QAA16762; Thu, 11 Jun 1998 16:07:56 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp03.primenet.com, id smtpd016692; Thu Jun 11 16:07:48 1998 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id QAA00116; Thu, 11 Jun 1998 16:07:43 -0700 (MST) From: Terry Lambert Message-Id: <199806112307.QAA00116@usr09.primenet.com> Subject: Re: internationalization To: itojun@iijlab.net (Jun-ichiro itojun Itoh) Date: Thu, 11 Jun 1998 23:07:43 +0000 (GMT) Cc: joy@urc.ac.ru, kline@tao.thought.org, tlambert@primenet.com, hackers@FreeBSD.ORG In-Reply-To: <20418.897564813@coconut.itojun.org> from "Jun-ichiro itojun Itoh" at Jun 11, 98 08:33:33 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > Handling (searching/indexing) multilingual data and storing > multilingual data can be done in separate method (and I prefer > them to be orthogonal). I prefer collation sequence to be handled in a seperate methoid, as well. > IMHO, for storing information we must retain as much information as > possible, so iso-2022 wins here (because it is fully multilingual, > even from standpoint of asian language users). ISO 2022 is inferior to SGML. > If you store the original information using a format that unifies > part of information in the source (e.g. Unicode) you'll lose some of > the very important part in the file, and the lossage will not be > recovered. Not if you store the file in marked-up format, unless you are arguing that you can't store SGML tags in Unicode, but you *can* store them in ISO8859-1? > For example, if you convert all the file you have into uppercase > for searching, you'll never recover the uppercase/lowercase > information. This is why you compile your regular expressions: to save the expense of the duplcation and conversion to avoid damaging the original data. > Unicode's unification is quite similar to this, > for asian language speakers (especially multilingual-targetted > people). This is why there are round-trip character sets, and why there is still locale information required. > xpg4 (runelocale) library provides a beautiful way of establishing > (2) in the above. > You can have a source file with ANY encoding you prefer. If you > set environment variable LANG (setenv LANG=ja_JP.EUC), rune library > will convert everything into wchar_t on read, via functions like > fgetrune(). This is *NOT* beautiful, unless you are in the business of selling very fast microprocessors to people who already own fast microprocessors. Trading markup for storage encoding that doesn't match processing encoding is a bad trade. It increases processing overhead drastically for no real gain. > >I don't want to say we should stop using ISO 2022. I just want to say > >we shouldn't stop (should start) using Unicode. I.e. to use both > >of them, as both have their advantages and disadvantages. > > Yes, I agree that Unicode can be useful in some places. But I > do not like Unicode be the encoding for data sources (and Unicode > tend to be stressed toward that). That way important portion of > the information will be lost. Not if you encode it in-band, like the standard says you are supposed to do, using a markup language (preferaably a widely accepted standard, such as SGML or the SGML DTD for RTF). > For conversion, there seems to be a standard function defined such as > iconv(3) or iconv_open(3). I'm thinking of implementing this, but it > requires me to have a giant table, such as: > iso-2022<->unicode with japanese gryphs > iso-2022<->unicode with korean gryphs > iso-2022<->unicode with chinese gryphs > and more... > somewhere in the filesystem. It is not required in the kernel, except in support of legacy systems exporting FS's via NFS. Even then, it's still not required to be in the kernel, if you are willing to accept a latency penalty for accessing legacy systems that you are too stubborn (or otherwise unable) to upgrade like you should. > >Having an internationalized OS still require the ability of the user > >to comunicate with other, non-internationalized parties with 8-bit > >or other character sets. > > I maybe not getting what you mean here... NFS systems running an ISO 8859-1 character set are the most common deployed case. The data stream needs to be attributed in the kernel. It is very tempting to attribute files as "text" and convert only text files. For UTF2 (16 bit wchar_t process encoded Unicode or ISO 10646 0/16) encoded files, it is trivial to expand one page to two pages when memory mapping the file. It is a hell of a lot less trivial to memory map a EUC encoded JIS-208 (or UTF-7/8 encoded Unicode) file using only code points 0x0000 to 0x00ff. This is because there is not a fixed expansion/contraction ratio, nor is there a mechanism for faultin on non-page boundries in most modern processors. I'm not willing to give up the ability to memory map files that contain text. > Or, do you mean how to literally convert Japanese/Chinese into ASCII? > Yes, there are several ways. Such as ROMA-JI for Japanese > (I can write Japanese words in ASCII: "Fujiyama" "Geisha" "Sushi" > "Harakiri"), or ping-ying for Chinese (correct?). Or you can convert the Japanese words into Katakana, instead, so long as you are willing to use the appropriate 8-bit character set, and forsake Kanji to do it. The sword cuts both directions. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message