From owner-freebsd-hackers  Tue Apr  4 19:18:23 2000
Delivered-To: freebsd-hackers@freebsd.org
Received: from phobos.illtel.denver.co.us (dsl-206.169.4.82.wenet.com [206.169.4.82])
	by hub.freebsd.org (Postfix) with ESMTP id D269537BC59
	for <freebsd-hackers@FreeBSD.ORG>; Tue,  4 Apr 2000 19:18:15 -0700 (PDT)
	(envelope-from abelits@phobos.illtel.denver.co.us)
Received: from localhost (abelits@localhost)
	by phobos.illtel.denver.co.us (8.9.3/8.9.3) with ESMTP id TAA11683;
	Tue, 4 Apr 2000 19:19:06 -0700
Date: Tue, 4 Apr 2000 19:19:06 -0700 (PDT)
From: Alex Belits <abelits@phobos.illtel.denver.co.us>
To: "G. Adam Stanislav" <adam@whizkidtech.net>
Cc: freebsd-hackers@FreeBSD.ORG
Subject: Re: Unicode on FreeBSD
In-Reply-To: <20000404201412.C261@whizkidtech.net>
Message-ID: <Pine.LNX.4.20.0004041827290.11214-100000@phobos.illtel.denver.co.us>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Tue, 4 Apr 2000, G. Adam Stanislav wrote:

> On Tue, Apr 04, 2000 at 05:05:05PM -0700, Alex Belits wrote:
> >  The existing "market" of multilingual application is so small, and it's
> >based on so simplistic requirements (to be able to display and print
> >characters, and make multilingual "web pages"), that even solution so much
> >flawed as standardization on Unicode can survive. Unicode is positioned as
> >the _replacement_ for languages/charsets handling infrastructure -- "we
> >know all the characters, so we can write all the words, right?".
> 
> Not so. Unicode is a character map. One of many. It just happens to be
> the most inclusive one in existence.

  It is. However if you look at the current efforts of its "adoption", it
is not used as one. It's touted as the solution to all language-related
problems, as a replacement of language/charset labeling infrastructure
and as the necessary prerequisite for any multilingual text processing.

[skipped]

> It does not, for example, provide sorting order. It cannot. Unicode is
> not about linguistics, it is about mapping characters regardless of their
> use in specific languages. And different languages sort characters
> differently. For example, in Slovak, "ch" is considered a character
> which belongs after the "h". In other languages it is sorted differently.
> And in most languages, it is just two unrelated characters.

  This is the kind of work that currently nonexistent language support
infrastructure should do -- when some language is encountered in
"multilingual" document/protocol/... its name can be used to load the
procedures (in this case sorting but it may be hyphenation, phonetic
match, etc.) for that particular language, and if no matched language is
known or supported, data should be just left alone. The same
infrastructure can be designed to support charsets and encodings, doing
conversion between them (and unicode) only where possible and necessary,
and providing the text in either "original" or "preferred", "supported",
etc. encoding for the language for the particular operation that should be
performed on the text. If such thing will be implemented, all existing
charset-specific routines that now exist in various places, can be reused,
and compatibility with existing software can be achieved without any
significant pain.

> Unicode is not simplistic. It does what its stated goal is, and it does
> it well. How we use it, is up to us.
> 
> Cheers,
> Adam
> 
> P.S. Hmmm... Interesting. I noticed my random quote contains a C-caron.
> I wonder how it is going to be handled. :)

  It was handled pretty well for such a primitive system as pine in
xterm. Since your charset was iso 8859-2, it was marked as such in
Content-Type header of the message. pine given me a warning:

---8<---
    [ The following text is in the "iso-8859-2" character set. ]
    [ Your display is set for the "koi8-r" character set.  ]
    [ Some characters may be displayed incorrectly. ]
--->8---

and displayed the text. xterm used the default font that happened to be in
koi8-r charset, displaying C-caron as cyrillic ha. I have read the
warning, manually switched xterm to a font in iso 8859-2 charset, and text
was displayed correctly. If I used a gui-based MUA such as Netscape (what
I didn't because Netscape Messenger sucks for reasons that have nothing to
do with its charsets support), it would just display the message in the
charset defined in the header.

-- 
Alex

----------------------------------------------------------------------
 Excellent.. now give users the option to cut your hair you hippie!
                                                  -- Anonymous Coward


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message