Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 1 Mar 2001 05:41:22 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        jonathan@graehl.org (Jonathan Graehl)
Cc:        freebsd-arch@FreeBSD.ORG (freebsd-Arch)
Subject:   Re: Unicode, command line options, and configuration files, oh my!
Message-ID:  <200103010541.WAA17385@usr05.primenet.com>
In-Reply-To: <NCBBLOALCKKINBNNEDDLAELNDLAA.jonathan@graehl.org> from "Jonathan Graehl" at Feb 28, 2001 01:48:49 PM

next in thread | previous in thread | raw e-mail | index | archive | help
[ ... Unicode ... ]

UTF encoded data is not fixed length in size.

POSIX specifies that file names can be up to 256 characters.

256 characters UTF-8 encoded can vary from 256 to 1280
characters.

In general, this means that for Unicode data stored for
directory entries would require that a directory entry
block would have to be 512b, whereas for UTF-8, we are
talking 2048b (2k).

If the same approach is used as the current UFS code uses,
then these operations will need to be directory entry block
atomic.

FS stuff aside, most programs should use internal encoding.

For FS storage, fixed data records are also a problem, when
using UTF-8 encoding.  The same goes for the ability to
store fixed size input forms field data in databases, which
like constraints set on record sizes.


> There doesn't seem to be any impetus to systematically adopt
> Unicode (especially the fixed-two-bytes-per-char variant,
> which for most cases would simply double the storage/bandwidth
> requirement), although there are user-applications which
> operate on multibyte text.

UTF-8 is one character per byte for US ASCII, two bytes for
the high page (128 characters) of ISO 8859-1, and three or more
bytes for anything else.

The idea that storage requirements increase is U.S. centric;
all other character sets are penalized at least as much as if
it were directly encoded instead of multibyte encoded, and
the vast majority more penalized.

On top of that, we have Microsoft and Java interoperability to
consider, distasteful as that may be to some.

There's an interesting list of Unicode resources available at:
http://www.unicode.org/unicode/onlinedat/products.html


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200103010541.WAA17385>