Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 1 Mar 2001 00:02:07 -0600
From:      "Michael C . Wu" <keichii@iteration.net>
To:        Terry Lambert <tlambert@primenet.com>
Cc:        Jonathan Graehl <jonathan@graehl.org>, freebsd-Arch <freebsd-arch@FreeBSD.ORG>, i18n@freebsd.org
Subject:   Re: Unicode, command line options, and configuration files, oh my!
Message-ID:  <20010301000207.C4359@peorth.iteration.net>
In-Reply-To: <200103010541.WAA17385@usr05.primenet.com>; from tlambert@primenet.com on Thu, Mar 01, 2001 at 05:41:22AM %2B0000
References:  <NCBBLOALCKKINBNNEDDLAELNDLAA.jonathan@graehl.org> <200103010541.WAA17385@usr05.primenet.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Use -i18n please. ")

On Thu, Mar 01, 2001 at 05:41:22AM +0000, Terry Lambert scribbled:
| [ ... Unicode ... ]
| 
| UTF encoded data is not fixed length in size.
| 
| POSIX specifies that file names can be up to 256 characters.
| 
| 256 characters UTF-8 encoded can vary from 256 to 1280
| characters.
|
| In general, this means that for Unicode data stored for
| directory entries would require that a directory entry
| block would have to be 512b, whereas for UTF-8, we are
| talking 2048b (2k).
| 
| If the same approach is used as the current UFS code uses,
| then these operations will need to be directory entry block
| atomic.

In short, we can save the file name that the user sees 
with the file data.  The filesystem and the kernel sees
some other naming scheme determined by the FS/kernel.

| FS stuff aside, most programs should use internal encoding.
| 
| For FS storage, fixed data records are also a problem, when
| using UTF-8 encoding.  The same goes for the ability to
| store fixed size input forms field data in databases, which
| like constraints set on record sizes.
| 
| 
| > There doesn't seem to be any impetus to systematically adopt
| > Unicode (especially the fixed-two-bytes-per-char variant,
| > which for most cases would simply double the storage/bandwidth
| > requirement), although there are user-applications which
| > operate on multibyte text.
| 
| UTF-8 is one character per byte for US ASCII, two bytes for
| the high page (128 characters) of ISO 8859-1, and three or more
| bytes for anything else.

Bad design. period.

| The idea that storage requirements increase is U.S. centric;
| all other character sets are penalized at least as much as if
| it were directly encoded instead of multibyte encoded, and
| the vast majority more penalized.

Yup, bad design. :)

| On top of that, we have Microsoft and Java interoperability to
| consider, distasteful as that may be to some.

M$ has a pretty good implementation here.
Java I18N sucks really bad.

| There's an interesting list of Unicode resources available at:
| http://www.unicode.org/unicode/onlinedat/products.html

-- 
+------------------------------------------------------------------+
| keichii@peorth.iteration.net         | keichii@bsdconspiracy.net |
| http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. |
+------------------------------------------------------------------+

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-i18n" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010301000207.C4359>