Date: Fri, 30 Mar 2001 12:35:05 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: rdm@cfcl.com (Rich Morin) Cc: freebsd-chat@FreeBSD.ORG Subject: Re: Unicode, 8-bit cleanliness, etc. Message-ID: <200103301235.FAA06280@usr05.primenet.com> In-Reply-To: <p05001932b6e891d8ebed@[192.168.168.205]> from "Rich Morin" at Mar 28, 2001 11:19:18 PM
next in thread | previous in thread | raw e-mail | index | archive | help
> I recently started playing with Mac OS X, which allows Unicode (UTF-8, > AFAIK) in its path names. Because I'm also using my trusty FreeBSD box, > I'm wondering if there's any reason to worry about compatibility. So, > is FreeBSD totally 8-bit clean or are there some tarpits I should avoid? FreeBSD is _not_ 8 bit clean. Neither is being 8 bit clean, if it were, sufficient. Computational representation of Unicode data is either as 16 bit wchar_t instead of signed char data, or 32 bit wchar_t. Both wchar_t values are unsigned. The 32 bit type does nothing but waste space, since the people who whined about it have failed to allocate anything beyond the default code page in the 32 bit representation, which is all high bits 0, and all low bits equal to the 16 bit varies, which is to say ISO-10646. It seems that the complainers weren't coders. FreeBSD, and the programs on it, frequently use "char" instead of "unsigned char" to refer to character data. They also do pointer arithmatic, and other manipulation, which assume that character values are 8 bit. A common case is to attribute character data by using "the next size up" (16 bit shorts) to allow the data to be attributed. This usage really requires the definition of a larger type, which itself implies that a 32 bit wchar_t is unacceptable. Additionally, UTF-8 encoding is a problem. This is because in order to process the data for collation, etc. (even the simple sort of output by "ls" or a file borwser), it is _required_ to intern this data not as 8 bit clean character strings, but as unencoded wchar_t arrays. This is particularly problematic for external representation, as well, for things like pipe, tty, and other device data. Think "cat a b > c", or worse, a sed script, or think of the round trip requirement in your mounting a legacy KOI-8or ISO 8859-2 FS into a "Unicode" system using UTF-8 (quoted for obvious reasons of pseudo-truth of the label), or vice-versa. You can not expect the legacy system to perform the round-tripping of the data, which means you have to put it in the kernel. Finally, path names are permitted by POSIX to be 255 total characters. UTF-8 encoded character strings for 255 16 bit wchar_t characters vary from 255 to 1275 8 bit characters; this value goes to 2550 8 bit characters for 32 bit wchar_t. A FreeBSD system (any system) not capable of supporting a file name of this length, and using UTF-8 for path data renders these systems non-interoperable. It is much, much cleaner to co got 16 bit wchar_t for both internal and external representation, and deal with legacy issues with Os translation, rather than trying to jam legacy compatability by putting encoding and decoding, along with the externalization exceptions, into each and every program. All in all, I guess this says "FreeBSD doesn't have this worked out, but neither does Mac OS X, and Windows barely has it worked out, and is still fighting the legacy program issue". Windows, by the way, handles compatability by having an "old 8 bit" and "new 16 bit" namespace, and doing immediate binding (not late binding) of names between the namespaces; in other words, they bit the backward compatability bullet, and are eating the legacy application conversion on a program by program basis, as a problem for the program vendors to resolve. This type of thing becomes significantly easier, if you list out all the legacy issues, and decide on a standard strategy for how you are going to handle them. PS: The above totally ignores the tools problem of how you would go about representing statically initialized Unicode character data in programs. In particular, the XPG/4 soloution for this was the use of trigraphs; this was very much discouraged, with the stated preference being for the use of message catalogs for storing such strings. PPS: Nedless to say, this complicates "hello world"; the way Microsoft dealt with this problem (user programs with names that vary only by directory in which the executables exist) vs. the way X/Open (the source of XPG/3 and XPG/4) dealt with this problem (flat catalog namespace) are also very telling about thinking out the commercial implications of the problem. The Sun ([...]/com/sunsoft/machine/program/) also assume that there will be no local developers on the machine, since catalog installation by vendor requires root access, and can not be performed by ordinary users, and is also very telling. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200103301235.FAA06280>