From owner-freebsd-hackers  Wed Mar 12 10:13:45 1997
Return-Path: <owner-hackers>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id KAA29985
          for hackers-outgoing; Wed, 12 Mar 1997 10:13:45 -0800 (PST)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.50])
          by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id KAA29979
          for <hackers@freebsd.org>; Wed, 12 Mar 1997 10:13:42 -0800 (PST)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id LAA27652; Wed, 12 Mar 1997 11:00:47 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199703121800.LAA27652@phaeton.artisoft.com>
Subject: Re: Q: Locale - is it possible to change on the fly?
To: jfieber@indiana.edu (John Fieber)
Date: Wed, 12 Mar 1997 11:00:47 -0700 (MST)
Cc: terry@lambert.org, pam@polynet.lviv.ua, hackers@freebsd.org
In-Reply-To: <Pine.BSF.3.95q.970311220457.26807G-100000@fallout.campusview.indiana.edu> from "John Fieber" at Mar 11, 97 11:03:29 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> > Like Unicode, it is a tool for localization, not multinationalization;
> > tools for multinationalization don't really exist, per se, since their
> > application is limited to language researchers and translators.  The
> 
> Huh?
> 
> The Unicode 2.0 standard explicitly states multilingual computing
> as the primary goal of the development effort. (First sentence in
> section 1.1: Design Goals.)
> 
> The problem with locales is that they address the operating
> environment for software, but blindly assume it to be appropriate
> for whatever data is encountered.  Some dimensions of the locale
> may remain "local", but other parts need to be driven by the
> data, not the LANG environment variable.  For well behaved MIME
> mail messages this can work pretty well, but it does not work in
> the general case. 
> 
> Unicode attempts to help out here by providing a locale
> independent data coding scheme.  With an en_US.ISO_8859-1 locale,
> document in Russian (KOI8-R) cannot be properly processed.  If I
> want to index it, how do I know what codes constitute word
> boundaries?  What if I want to combine Russian and French in the
> same index, or, heaven forbid, in the same document?  Now, if I
> had an en_US.UTF locale (I actually do, but it is little buggy)
> and the Russian and French document was in unicode, I could
> sensibly process it in a useful manner even though my preferred
> locale was different.

Unicode is a character encoding standard, not a font encoding standard;
because of this, Unicode can not simultaneously represent Chinese and
Japanese characters, it can only represent characters, period.  The
"Japanese-ness" or "Chinese-ness" of characters is a font property,
not a character property.  A character is not one of its possible
glyphs.

Likewise, Unicode can not encode the ligature relationships between
code points for ligatured languages, such as Arabic, Aramaic, Sanskrit,
Hebrew, Tamil, Devengari, other Indic languages, or even, to get to
brass tacks, cursive English.  It is a *character* encoding standard.

The problem with representing multilingual documents is dealt with
using compounding, in an implementation dependent fashion, to achieve
font encoding.  The compounding mechanism is beyond the scope of the
Unicode standard.

The closest you can get to multilingual support is to use a round trip
character set with font assignments for code points.  For example, the
ISO 8859-1 (Latin-1) character set can support several languages at
the same time for a given document; therefore, a Unicode document can
represent those same languages because there are round-trip code-points
for the characters in the 8859-1 standard.  Likewise, JIS 208 + JIS 212
can wholly support 21 seperate languages.


But you can not encode Chinese and Japanese simultaneously because
there is no common character set, with a defined round trip mapping
table, for doing that.


> Multilingual applications limited to linguists?  I suspect there
> are plenty of people who know and use languages that don't share
> the same character encoding. :)  Unicode also provides a rich
> assortment of other things useful regardless of your language.

You misunderstand me... Just because Unicode is useless, by itself,
for multilingual processing (it's a tool for localization of software
to a specific round-trip locale, with no additional modifications of
the software), does not mean that it is useless entirely.


> How many times have you seen web pages with the telltale signs of
> "smart quotes"?  Box drawing characters that are portable across
> platforms?  Wheee!  Math symbols?  Lots of people could use a
> richer set than + - / * and ^.

You can't use Unicode for this... how can you attribute fonts on, for
instance, a Japanese www page on Chinese poetry?  Any character sets
which have mutually unified code points that have different glyphs
can not be simultaneously represented without font attribution.  The
Unicode standard is not a glyph encoding standard.

> > best you can hope for is picking a single round-trip character set
> > that supports both your languages.  You will never find one of these
> > for, for example, Chinese and Japanese.
> 
> I gather it is possible to round-trip CJK conversions through
> unicode by utilizing the private use area.  I don't speak from
> direct experience on this however.

Yes and no.  The private use area is too small for most scholarly
texts, for instance, because the round trip would require nearly
20,000 private use characters (for instance, a side-by-side
representation of Japanese and Chinese text in a Japanese textbook on
"Chinese language for advanced linguists".

Typical use for Unicode is for storage representation of locale
specific data, such that the actual encoding doesn't vary from locale
to locale.  In other words, it's a tool for localizing to a single
locale out of many possible locales, not for representing multiple
locales simultaneously active (the input issues, alone, for something
like that, would be prohibitive).  The closest you could come would be
a tool for a translator translating strings from one locale to another
for the purpose of moving software into the new locale -- even then,
you would probably implement as cooperating applications, instead of
a single application, each in their own locale.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.