Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Jun 1998 21:51:48 -0700 (PDT)
From:      Gary Kline <kline@tao.thought.org>
To:        tlambert@primenet.com (Terry Lambert)
Cc:        joy@urc.ac.ru, itojun@itojun.org, tlambert@primenet.com, hackers@FreeBSD.ORG
Subject:   Re: internationalization
Message-ID:  <199806120451.VAA13595@tao.thought.org>
In-Reply-To: <199806120119.SAA06619@usr09.primenet.com> from Terry Lambert at "Jun 12, 98 01:19:28 am"

next in thread | previous in thread | raw e-mail | index | archive | help
According to Terry Lambert:
> > 		Let me pose the same question, a bit more broadly.
> > 		Why cannot we support _both_ the ISO and Unicode
> > 		paradigms?  Are these absolutely incompatible systems?
> > 		Is there some kind of ``religious-war''?   Or is it
> > 		simply too difficult?
> 
> ISO 10646 code page 0 *is* Unicode, by definition.
> 
> The religious aspects have to do with the old trade-offs the various
> programmers are already used to, the new trade-offs the various
> programmers would have to start putting up with, and the various
> language bigotries people bring to the table.


	I'm approaching this with relatively little bigotry or
	other baggage; my bias is against bias itself.  That said,
	I've been around enough decades to realize that virtually
	everyone carries latent bigotries of some ilk.  I'd just
	rather stay above as much of it as possible here.

	So far this discussion looks promiising; and thanks here
	noted to everyone.


> 
> 
> Major premise: everyone is going to have to put up with a non-8-bit
> wchar_t internally in their applications.  This is called the "raw"
> or "process" representation.


	This, I not only believe, but agree with.  Memory is cheap;
	disk is cheap; so having character set be a wchar_t (either
	16 or 32 bits) is no major obstacle.

> 
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> Are most of your files in ASCII?
> ----------------------------------------------------------------------
> 
> Then you want UTF7/UTF8/ISO2022 encoding, so you don't have to change
> them.  Unless you plan to export your software.  Let the non-English
> speaking world deal with the incompatabilities and storage bloat
> problems.  You'll deal with it in your software when Japan and Europe
> "get their act together" and standardize on IBM-PC derived hardware
> so that your software won't have to be ported to run.
> 
> Besides, C code is in the "C" locale, and that's US-ASCII already.
> GCC supports tri-glyphs, right?
> 
> ----------------------------------------------------------------------
> Are most of your files in ISO8859-X and/or KOI-8X?
> ----------------------------------------------------------------------
> 
> Then you don't want UTF7/UTF8, because if you get them, some
> characters that currently take up one byte will take up between one
> and three bytes (one if they are US ASCII, more if they are in the
> 0x80-0xff range).
> 
> You also don't want ISO2022, because instead of simply choosing a
> locale for all your data, you will have to deal with character set
> shift processing.
> 
> You could put up with UTF2, because you could do kernel magic to
> expand existing text files on existing filesystems by setting a
> per FS attribute that tells how to get the data in and out of
> Unicode representation.  You still need a "magic doohickey" that
> tells the filesystem to do this for text files, but not for other
> files.
> 
> ----------------------------------------------------------------------
> Are most of your files in ISO2022-jp (JIS-208/JIS-212)?
> ----------------------------------------------------------------------
> 
> Then you don't want UTF7/UTF8/UTF2 encoding, because you don't
> want to have to convert your data.  You don't want Unicode because
> it means you'll have to deal with the sorting problem all over
> again because Unicode's collation sequence isn't the JIS-208/JIS-212
> collation sequence.
> 

	I understand your point, Terry.  Over the coming days, 
	weeks, I'll experiment with 16- and 32-bit wide chars,
	and see how Ito-san's nvi's port works.  If his iso-2022
	messages are catalogs, that's most of the battle.



> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> 
> What this boils down to is language bigotry, and whose language
> you prefer.  Generally, the preference is either driven by personal
> or economic interests (like competitive advantage to your own locale
> from having your locale's preferred method chosen.
> 
> The short sighted approach is to make the decision based on your own
> personal bigotry.
> 
> The longer sighted approach is to make the decision which has the
> best workarounds for backward compatability and in-place conversion,
> and the least impact in the future based on the assumption that the
> software market is going to normalize all over the world at some point
> in the future, and you just may be around still and have to deal with
> it.  Like the Y2K problem.


	By the time the market normalizes we're likely to be dust.
	Eventually tho, sure.

> 
> 
> 
> If the aliens land, and we end up needing more than 2^16 characters
> in out wchar_t space, well, we can deal with that problem when it
> happens.
> 

	I think we already need 32-bit wchar_t's now.  For the sake
	of completeness.  ... To be continued.

	gary

> 
> 


-- 
   Gary D. Kline         kline@tao.thought.org          Public service uNix


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199806120451.VAA13595>