From owner-freebsd-hackers  Wed Jun 10 17:16:55 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id RAA28768
          for freebsd-hackers-outgoing; Wed, 10 Jun 1998 17:16:55 -0700 (PDT)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from gershwin.tera.com (gershwin.tera.com [207.224.230.28])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id RAA28716
          for <hackers@freebsd.org>; Wed, 10 Jun 1998 17:16:43 -0700 (PDT)
          (envelope-from kline@tao.thought.org)
Received: from tao.thought.org (tao.tera.com [207.108.223.55])
	by gershwin.tera.com (8.8.8/8.8.8) with ESMTP id RAA00745;
	Wed, 10 Jun 1998 17:15:55 -0700 (PDT)
Received: (from kline@localhost) by tao.thought.org (8.8.8/8.7.3) id RAA09151; Wed, 10 Jun 1998 17:15:33 -0700 (PDT)
From: Gary Kline <kline@tao.thought.org>
Message-Id: <199806110015.RAA09151@tao.thought.org>
Subject: Re: internationalization
In-Reply-To: <199806102155.OAA13862@usr01.primenet.com> from Terry Lambert at "Jun 10, 98 09:55:44 pm"
To: tlambert@primenet.com (Terry Lambert)
Date: Wed, 10 Jun 1998 17:15:33 -0700 (PDT)
Cc: hackers@FreeBSD.ORG
Organization: <> thought.org: public access uNix in service... <>
X-Mailer: ELM [version 2.4ME+ PL32 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

According to Terry Lambert:

		[[ ... ]]
> >   I'm going ahead with my current implementation and look forward to
> >   hearing from any other hackers who are interested in this.

		---I've been doing further digging since your 
		mail.  This black-hole keeps getting more 
		interesting....

> 
> I'm interested.
> 
> Part of the problem here is that FreeBSD doesn't fully support XPG/4.
> 
> Another part of the problem is that XPG/4 is encoded multibyte, which
> is bad from a number of major perspectives, starting with ISO2022.


		We've got v 2.0 of the xpg4 library in 2.2.6.
		Do you know if any other flavor of BSD has more
		complete support?


> 
> I would prefer going to a full-on Unicode implementation to support
> all known human languages.
> 

		This was my first leaning, but I'm increasingly
		going toward the ISO families.


> I would suggest an initial 16 bit wchar_t with an assumption of a
> zero valued code page designator.  If ISO ever gets around to adding
> other code pages, we can deal with that at that time using page
> selection.  Meanwhile, we'll be able to interportate with Microsoft
> and JAVA, which use 16 bit wchar_t encodings.
> 
> 
> I think the first (and hardest) step is the shells.  The shells need
> to be internationalized based on the fact that they (can) intrpret
> exit codes to the user as error messages.


		Exit codes, certainly; but where you've got syserror()
		output, that's another issue.  Agree that the shells 
		are the base.  csh|tcsh, and the sh|ksh group.  

> 
> The last time I converted csh, this was absolute hell because the
> code was badly organized for internationalization.
> 
> The next hardest step is the editors, starting with "vi".  They have
> to be able to support Unicode.

		
		nvi/nex already have been tweaked for 8-bit international
		support.  I learned this accidently.  WAs quite
		surprised to see messages in French and German.  :-)

		Nonetheless, I see why you like the Unicode solution.
		Someone said, ``Well, French support is great, but how
		are you going to handle Japanese?''


> 
> I have had FS-based Unicode support working for a very long time,
> though it has failed to be committed.  One big issue is that directory
> entry blocks must grow from 512b to 1k.  This has a number of
> implications to the soft updates work currently in progress.  This is
> because, in order to support a maximally sized path component, 512 + 24
> bytes is needed for unicaode, as opposed to 256 + 24 (which fits in 512b)
> for an 8 bit charaacter set.


		:-( !

		How does the ISO2022 model work here?  Isn't it the
		same for Japanese and Chinese?  
> 
> If we were to do something stupid, like UTF-7 or UTF-8, it would have
> to grow to 5 * 256 + 24, minimally, to support 5:1 character expansion
> possible, as opposed to the 2:1 of flat Unicode encoding.

		
		You've lost me here.  What does the translation format
		do, or rather, how?

> 
> For character set attributed FS's (like NFS v2/v3 will have to be), you
> can do the translation in in the kernel on the blocks on their way out
> (a 2:1 expnasion in memory of a 1:1 disk image for a given ISO character
> set attribution for the filesystem).
> 
> 


		Thanks for your feedback.  It's probably a good idea
		to consider the broader design issues now than to
		paint myself into a corner.


		gary


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message