Date: Thu, 1 Mar 2001 05:41:22 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: jonathan@graehl.org (Jonathan Graehl) Cc: freebsd-arch@FreeBSD.ORG (freebsd-Arch) Subject: Re: Unicode, command line options, and configuration files, oh my! Message-ID: <200103010541.WAA17385@usr05.primenet.com> In-Reply-To: <NCBBLOALCKKINBNNEDDLAELNDLAA.jonathan@graehl.org> from "Jonathan Graehl" at Feb 28, 2001 01:48:49 PM
next in thread | previous in thread | raw e-mail | index | archive | help
[ ... Unicode ... ] UTF encoded data is not fixed length in size. POSIX specifies that file names can be up to 256 characters. 256 characters UTF-8 encoded can vary from 256 to 1280 characters. In general, this means that for Unicode data stored for directory entries would require that a directory entry block would have to be 512b, whereas for UTF-8, we are talking 2048b (2k). If the same approach is used as the current UFS code uses, then these operations will need to be directory entry block atomic. FS stuff aside, most programs should use internal encoding. For FS storage, fixed data records are also a problem, when using UTF-8 encoding. The same goes for the ability to store fixed size input forms field data in databases, which like constraints set on record sizes. > There doesn't seem to be any impetus to systematically adopt > Unicode (especially the fixed-two-bytes-per-char variant, > which for most cases would simply double the storage/bandwidth > requirement), although there are user-applications which > operate on multibyte text. UTF-8 is one character per byte for US ASCII, two bytes for the high page (128 characters) of ISO 8859-1, and three or more bytes for anything else. The idea that storage requirements increase is U.S. centric; all other character sets are penalized at least as much as if it were directly encoded instead of multibyte encoded, and the vast majority more penalized. On top of that, we have Microsoft and Java interoperability to consider, distasteful as that may be to some. There's an interesting list of Unicode resources available at: http://www.unicode.org/unicode/onlinedat/products.html Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200103010541.WAA17385>