Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 8 May 1999 01:17:17 -0500
From:      "G. Adam Stanislav" <adam@whizkidtech.net>
To:        Thomas David Rivers <rivers@dignus.com>
Cc:        dima@tejblum.dnttm.rssi.ru, freebsd-hackers@FreeBSD.ORG
Subject:   Re: wc* routines
Message-ID:  <19990508011716.A218@whizkidtech.net>
In-Reply-To: <199905080401.AAA01012@lakes.dignus.com>; from Thomas David Rivers on Sat, May 08, 1999 at 12:01:55AM -0400
References:  <19990507201101.A216@whizkidtech.net> <199905080401.AAA01012@lakes.dignus.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, May 08, 1999 at 12:01:55AM -0400, Thomas David Rivers wrote:
>  Actually - it's up to the program developers - they can deliver their
>  program linked either statically, or dynamically.

Mhm, you had me go to man gcc, and sure enough, there is a -static option I was
not aware of before.

> [snip]
>  As far as the potential load time improvements - well, the maintanence
>  burden just got too big to support it.
> [snip]
>  You can use shared memory... but, honestly, I don't think it will be
>  a big issue...   I mean, it's 3000 characters, give or take, which is
>  6000 bytes (maybe 12000 if you have a 4-byte character code) - this
>  really isn't enough to bother with.

No, no, I wish it were that simple! 3000 in the subset I am considering for
hard coding just in case the file is not there (actually giving the user a
choice of deliberately not having the file if they do not want or need all
of Unicode).

But currently there are close to 7000 characters in Unicode (I just checked),
not counting "han" which is the combined character set for Chinese, Japanese,
and Korean (my Chinese dictionary alone has 7773 characters in it).

However, there are a number of proposals under consideration for the addition
of other scripts. With ISO 10646 extending Unicode to 31 bits, we can,
theoretically, end up with 2 billion characters. Now, we are far from there,
but ISO divides the space into planes. A plane is the upper 16 (or actually 15)
bits of the code. Each plane can then contain 64K characters. Even though
none does, and right now only plane 0 is in use, many of the proposals are
for plane 1 already.

That means that even though we only have thousands of characters, their codes
are not consecutive. Hence a simplest table would consist of at least 5 bytes
per character: 4 bytes to tell us which character we are talking about, 1
byte to tell us about the properties of the character.

But there is more... Take conversion functions, such as towupper and towlower.
There is no system to it. Unlike in ASCII, you cannot just turn a bit on or off.
You need a map.

So, for example to map the lower case variety, you need 4 bytes to list which
code we are dealing with, plus we need to list its lower case version. Now,
the lower case version will always be on the same plane, so you do not need
4 bytes for that. Indeed, it seems you could just use one byte (I need to
do further analysis to make sure that the upper and lower case character
always share the 3 upper bytes).

Then you need a separate map to go the other way. Unicode even defines the
title version of a character. Luckily ANSI C has no towtitle! But we do need
yet another table for towdigit.

Again, I need to do some further analysis to see how many characters can have
a lower case or upper case conversion, how many represent a digit, an xdigit,
etc. I am pretty sure that having separate maps for each function will take
up less space than having one global map. Most of the characters do not have
upper/lower case as that concept pretty much exists only in Roman, Greek,
and Cyrillic alphabets. There are no cases in Indic alphabets, Hebrew, Arabic,
etc. And in each alphabet only a few characters are digits.

As for xdigits, that is a recent mathematical concept. For all I know, a
hexadecimal number is by definition taken from Roman alphabet. Right now,
my towxdigit(c) looks like: { return toxdigit(c); }.

There is another good reason for separate maps for each function: ANSI C
reserves the right to add more conversion functions in the future. It is
much less painfull to add a new map for each new function than to redo
the structure of a single map every time a new function is added.

So, anyway, we are definitely talking about more than 12000 bytes.

That said, it is not as bad as it seems. Some of the maps can simply list
ranges of codes rather than the codes themselves, 8 bytes per range, i.e.,
from-to, from-to, from-to...

Others need to list every lower case/upper case combination, or digit/value
combination, and such, but there are only so many characters in each.

>  However, I'd add that:
> 	1)  I don't believe the complications that brings are
>             worth the savings
> 
> 	2)  It will keep me from using the library in other (non-UNIX)
> 	    contexts, which would be disappointing.
> 
> 	3)  You are not guaranteed that a FreeBSD kernel has been
> 	    compiled with shared memory enabled, so you'd need to
> 	    "fall back" to allocating them separately anyway.
> 
> 
> You goal of minimizing space is laudable... perhaps the file could
> be broken up so that each program only read the code it needed, instead
> of everyone reading the entire file?

Yes, that seems like a good idea. Perhaps putting each map into a separate
file. Not every program that uses towlower will use towdigit. And it is
easier to extend. Could, for example, add functions to convert between
katakana and hiragana without affecting programs that do not even know
what that is.

> As far as "finding the file" - I suggest you look at how the locale
> library does this (in fact, I suggest you simply "steal" much of this
> from there) - basically, define a location in /usr/include/paths.h.
> Look there for the file...  if it's not there - well, you didn't
> find it.

Ah, nice. I did not know about paths.h. Interesting, though. Makes me wonder
where I should put the files. I have been thinking of /usr/share/wctypes/
(or possibly /usr/share/wcmaps/).

The file names can simply be the same as the type names, as those are
defined by ANSI C, e.g. lower, upper, digit, xdigit, etc. As for the
conversion maps, they could be named toupper, tolower, todigit, etc.

I'm thinking outloud here. It is always helpful to me to write my ideas down
in a message. :-) I have already done it somewhat differently, but I have no
qualms about tossing what I have done and doing it over.

I just came with an idea while typing this: There is a function called wctype
which converts a text string, such as "alpha" into the corresponding wctype_t
value which is simply a number. I can actually add new types in the future
without having to rewrite the code! Each of the files (lower, upper...)
can contain its wctype_t inside. So, if wctype receives a string it does not
recognize, it can just check if a file of that name exists, and if so, it
can read its wctype_t from the file, and associate that number with that
name, so the rest of the library knows about it.

Hey, gee, thanks for the inspiration. :-)

Adam


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19990508011716.A218>