FreeBSD Mail Archives

Date:      Thu, 11 Jun 1998 22:36:57 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        itojun@iijlab.net (Jun-ichiro itojun Itoh)
Cc:        joy@urc.ac.ru, kline@tao.thought.org, tlambert@primenet.com, hackers@FreeBSD.ORG
Subject:   Re: internationalization
Message-ID:  <199806112236.PAA28653@usr09.primenet.com>
In-Reply-To: <11417.897551055@coconut.itojun.org> from "Jun-ichiro itojun Itoh" at Jun 11, 98 04:44:15 pm

> >>         Yes, iso-2022 families are quite important for supporting
> >>         asian languages.  Unicode is, for us Japanese, quite incomplete and
> >>         unexpandable.
> >Do you mean Unicode does not cover all the CJK characters?
> 
> 	Unicode maps different Chinese/Japanese/Korean letters into the same
> 	codepoint.  The actual appearance (gryph) will be determined by
> 	the selection of font. (so, there will be font just for Chinese,
> 	font just for Japanese, and font just for Korean).

This is an oversimplification.

There will be a font for each round-trip character set.  Character sets
for which standards existed that codified code points in different
languages were not unified.  For example, English and Japanese.

This is only a problem in the case of trying to use two locales
simultaneously.  This never happens, unless you are a linguistic
scholar or translator.

For linguistic scholars and translators, the issue is resolved by using
a markup language.  The cost of using a markup language is paid by the
people needing more than one locale at the same time.

As opposed to all of us having to pay for it, the tiny number of people
engaged in scholarship and translation have to pay for it.  This is
better because the people who benefit are made to pay for the benefit,
instead of everyone shoulding the burden for a few unique applications.


> 	Therefore, it may be sufficient for supporting single asian language
> 	(for example Japanization), it is not sufficient for
> 	multilingualization (C/J/K support at the same time).  With Unicode,
> 	you will never be able to write a plaintext with C/J/K letters mixed.
> 	For example, I frequently write such a plaintext, for list of plates
> 	for chinese restaurant, with description in Japanese attached.
> 	Such a plaintext cannot be generated with Unicode.

It can be generated with marked up Unicode, however.  Unicode is a
character set, not a font.  For resons previously detailed, Unicode
can *never* be a font.  And was never intended as one.

I defy you to show me a locale that supports both Japanese and Chinese
file names simultaneously.  You won't be able to do it because there
is no character set standard that includes both all of the Japanese
and all of the Chinese code points.


> >What is "unexpandable"?
> 
> 	Unicode people stressed Unicode because of the "fixed bitwidth"
> 	nature of Unicode.  Therefore, basically they will not be able to
> 	support more than 2^16 letters.
> 	Recently Unicode introduced "surrogate pair" which makes Unicode
> 	a variable bitwidth character set.  This breaks the key feature of
> 	Unicode, and it shows that Unicode is not expandable as nature.
> 	(Correct me if I'm wrong about "surrogate pair"...)

I believe you are.

The real issue is not Unicode, which is code page 0 of ISO 10646, but
ISO 10646 itself, which supports 2^32 letters; 2^16 letters in each of
2^16 code pages.

The only code page defined ringht now is code page 0/16, which is defined
to be Unicode.


> 	iso-2022 is well designed to accomodate new character sets to appear
> 	later.  Even with the most simplest subset it can accomodate bunch of
> 	character sets.

ISO 2022 is a font family markup standard, where font families are
made identical to round-trip character sets.

ISO 2022 is an *inferior* markup language, compared to SGML.


> 	Handling bare iso-2022 string is some hard to implement because it
> 	is variable length (yes I agree).  If we can provide a good library
> 	for iso-2022, then there's no reason for us to migrate to Unicode.

Except that 85% of the computer systems in the world and 90% of the
computers in the Western world are going to be running Unicode by the
year 2010 because of Microsoft Windows and JAVA.

And we would like to be able to interoperate with them without paying
a very high conversion overhead when we do it.


> >>         Yes, for Japanese, Chinese and Korean iso-2022 based model (euc-xx
> >>         falls into the category) is really important.  However, I 
> >Why not to support both ISO 2022 and Unicode? Yes, it is more difficult
> >to implement. But otherwise we can lose compatibility with other systems.
> 
> 	Of course my library support both of them.  If you say
> 	setrunelocale("UTF2"), the internal and external representation
> 	will be come Unicode.  If you say setrunelocale("ja_JP.iso-2022-jp")
> 	it will be come Japanese iso-2022-jp encoding.

This is certainly a step in the right direction; however, I would still
deperately encourage the use of 16 bit wchar_t for internal data
representation in programs operating in a single locale.

The entire ISO 8859-X using world has 8 bit characters.  Going to UTF2
is asking them to attribute FS's where possible, and where not possible,
double the storage requirements for data.

Going to 32 bits, especially given that ISO 10646, the largest character
set standard you can point at, only defines code page 0/16, is madness.
The Western world will simply refuse to bear the overhead of 4 times
the dataspace requirements to benefit the few people making Chinese
restraunt menus for use in Japan, and who refuse to use a markup language
to do it.

There are Western advocates, specifically those using US ASCII and 7
bit NRCS (National Replacement Character Sets) who advocate UTF-7 and
UTF-8 encoding so that they don't have to change their existing data
files to have their code support Japanese or Chinese.  There's no real
unified computing infrastructure in Japan (it being broken into vendor
specific hardware hardware markets), and that makes it a lot of expense
for very little potential market.  It's going to be hard enough to
convince the US idiots that trading more RAM for lower processing
overhead is a good idea.

The use of wchar_t as a font index is ill considered.  The font is not
the same as the character set, nor should it be.  The index should be
based on the relative offset into the font, and use a base+offset to
deal with multiple fonts in a single rendering space.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199806112236.PAA28653>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation