Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 12 Jun 1998 01:19:28 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        kline@tao.thought.org (Gary Kline)
Cc:        joy@urc.ac.ru, itojun@itojun.org, tlambert@primenet.com, hackers@FreeBSD.ORG
Subject:   Re: internationalization
Message-ID:  <199806120119.SAA06619@usr09.primenet.com>
In-Reply-To: <199806112234.PAA12768@tao.thought.org> from "Gary Kline" at Jun 11, 98 03:34:13 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> 		Let me pose the same question, a bit more broadly.
> 		Why cannot we support _both_ the ISO and Unicode
> 		paradigms?  Are these absolutely incompatible systems?
> 		Is there some kind of ``religious-war''?   Or is it
> 		simply too difficult?

ISO 10646 code page 0 *is* Unicode, by definition.

The religious aspects have to do with the old trade-offs the various
programmers are already used to, the new trade-offs the various
programmers would have to start putting up with, and the various
language bigotries people bring to the table.


Major premise: everyone is going to have to put up with a non-8-bit
wchar_t internally in their applications.  This is called the "raw"
or "process" representation.

----------------------------------------------------------------------
----------------------------------------------------------------------
Are most of your files in ASCII?
----------------------------------------------------------------------

Then you want UTF7/UTF8/ISO2022 encoding, so you don't have to change
them.  Unless you plan to export your software.  Let the non-English
speaking world deal with the incompatabilities and storage bloat
problems.  You'll deal with it in your software when Japan and Europe
"get their act together" and standardize on IBM-PC derived hardware
so that your software won't have to be ported to run.

Besides, C code is in the "C" locale, and that's US-ASCII already.
GCC supports tri-glyphs, right?

----------------------------------------------------------------------
Are most of your files in ISO8859-X and/or KOI-8X?
----------------------------------------------------------------------

Then you don't want UTF7/UTF8, because if you get them, some
characters that currently take up one byte will take up between one
and three bytes (one if they are US ASCII, more if they are in the
0x80-0xff range).

You also don't want ISO2022, because instead of simply choosing a
locale for all your data, you will have to deal with character set
shift processing.

You could put up with UTF2, because you could do kernel magic to
expand existing text files on existing filesystems by setting a
per FS attribute that tells how to get the data in and out of
Unicode representation.  You still need a "magic doohickey" that
tells the filesystem to do this for text files, but not for other
files.

----------------------------------------------------------------------
Are most of your files in ISO2022-jp (JIS-208/JIS-212)?
----------------------------------------------------------------------

Then you don't want UTF7/UTF8/UTF2 encoding, because you don't
want to have to convert your data.  You don't want Unicode because
it means you'll have to deal with the sorting problem all over
again because Unicode's collation sequence isn't the JIS-208/JIS-212
collation sequence.

You don't care about all the crap that goes withmultibyte encoding,
because you've already dealt with all the bugs that causes in all
your existing software already.

You don't care about the storage bloat, because you already need
as many bytes as the bloat will cause to store the characters in
the character sets you use, so it doesn't matter to you that the
code produced in other countries will bloat up and start evidencing
bugs it didn't used to have before they tried to localize into your
locale.
----------------------------------------------------------------------
----------------------------------------------------------------------

What this boils down to is language bigotry, and whose language
you prefer.  Generally, the preference is either driven by personal
or economic interests (like competitive advantage to your own locale
from having your locale's preferred method chosen.

The short sighted approach is to make the decision based on your own
personal bigotry.

The longer sighted approach is to make the decision which has the
best workarounds for backward compatability and in-place conversion,
and the least impact in the future based on the assumption that the
software market is going to normalize all over the world at some point
in the future, and you just may be around still and have to deal with
it.  Like the Y2K problem.


Of course, this totally ignores the fact that Microsoft owns the
world at the present time, and they've already made the correct
long term decision on the assumption that they will be around forever
and have to deal with it... another decision based on economic
interests, in fact.


If the aliens land, and we end up needing more than 2^16 characters
in out wchar_t space, well, we can deal with that problem when it
happens.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199806120119.SAA06619>