Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 30 Mar 2001 12:35:05 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        rdm@cfcl.com (Rich Morin)
Cc:        freebsd-chat@FreeBSD.ORG
Subject:   Re: Unicode, 8-bit cleanliness, etc.
Message-ID:  <200103301235.FAA06280@usr05.primenet.com>
In-Reply-To: <p05001932b6e891d8ebed@[192.168.168.205]> from "Rich Morin" at Mar 28, 2001 11:19:18 PM

next in thread | previous in thread | raw e-mail | index | archive | help
> I recently started playing with Mac OS X, which allows Unicode (UTF-8,
> AFAIK) in its path names.  Because I'm also using my trusty FreeBSD box,
> I'm wondering if there's any reason to worry about compatibility.  So,
> is FreeBSD totally 8-bit clean or are there some tarpits I should avoid?

FreeBSD is _not_ 8 bit clean.  Neither is being 8 bit clean, if
it were, sufficient.

Computational representation of Unicode data is either as
16 bit wchar_t instead of signed char data, or 32 bit wchar_t.

Both wchar_t values are unsigned.  The 32 bit type does
nothing but waste space, since the people who whined about
it have failed to allocate anything beyond the default code
page in the 32 bit representation, which is all high bits 0,
and all low bits equal to the 16 bit varies, which is to say
ISO-10646.  It seems that the complainers weren't coders.

FreeBSD, and the programs on it, frequently use "char" instead
of "unsigned char" to refer to character data.  They also do
pointer arithmatic, and other manipulation, which assume that
character values are 8 bit.  A common case is to attribute
character data by using "the next size up" (16 bit shorts) to
allow the data to be attributed.  This usage really requires
the definition of a larger type, which itself implies that a
32 bit wchar_t is unacceptable.

Additionally, UTF-8 encoding is a problem.  This is because in
order to process the data for collation, etc. (even the simple
sort of output by "ls" or a file borwser), it is _required_
to intern this data not as 8 bit clean character strings, but
as unencoded wchar_t arrays.  This is particularly problematic
for external representation, as well, for things like pipe,
tty, and other device data.  Think "cat a b > c", or worse, a
sed script, or think of the round trip requirement in your
mounting a legacy KOI-8or ISO 8859-2 FS into a "Unicode"
system using UTF-8 (quoted for obvious reasons of pseudo-truth
of the label), or vice-versa.  You can not expect the legacy
system to perform the round-tripping of the data, which means
you have to put it in the kernel.

Finally, path names are permitted by POSIX to be 255 total
characters.  UTF-8 encoded character strings for 255 16 bit
wchar_t characters vary from 255 to 1275 8 bit characters;
this value goes to 2550 8 bit characters for 32 bit wchar_t.
A FreeBSD system (any system) not capable of supporting a
file name of this length, and using UTF-8 for path data
renders these systems non-interoperable.

It is much, much cleaner to co got 16 bit wchar_t for both
internal and external representation, and deal with legacy
issues with Os translation, rather than trying to jam legacy
compatability by putting encoding and decoding, along with
the externalization exceptions, into each and every program.

All in all, I guess this says "FreeBSD doesn't have this
worked out, but neither does Mac OS X, and Windows barely
has it worked out, and is still fighting the legacy program
issue".  Windows, by the way, handles compatability by having
an "old 8 bit" and "new 16 bit" namespace, and doing immediate
binding (not late binding) of names between the namespaces;
in other words, they bit the backward compatability bullet,
and are eating the legacy application conversion on a program
by program basis, as a problem for the program vendors to
resolve.

This type of thing becomes significantly easier, if you list
out all the legacy issues, and decide on a standard strategy
for how you are going to handle them.

PS: The above totally ignores the tools problem of how you
would go about representing statically initialized Unicode
character data in programs.  In particular, the XPG/4
soloution for this was the use of trigraphs; this was very
much discouraged, with the stated preference being for the
use of message catalogs for storing such strings.

PPS: Nedless to say, this complicates "hello world"; the way
Microsoft dealt with this problem (user programs with names
that vary only by directory in which the executables exist)
vs. the way X/Open (the source of XPG/3 and XPG/4) dealt with
this problem (flat catalog namespace) are also very telling
about thinking out the commercial implications of the problem.
The Sun ([...]/com/sunsoft/machine/program/) also assume that
there will be no local developers on the machine, since catalog
installation by vendor requires root access, and can not be
performed by ordinary users, and is also very telling.

					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200103301235.FAA06280>