From owner-freebsd-hackers  Thu Jun 11 20:09:36 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id UAA14528
          for freebsd-hackers-outgoing; Thu, 11 Jun 1998 20:09:36 -0700 (PDT)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from smtp03.primenet.com (daemon@smtp03.primenet.com [206.165.6.133])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id UAA14478
          for <hackers@FreeBSD.ORG>; Thu, 11 Jun 1998 20:09:11 -0700 (PDT)
          (envelope-from tlambert@usr09.primenet.com)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.8.8/8.8.8) id UAA11581;
	Thu, 11 Jun 1998 20:09:07 -0700 (MST)
Received: from usr09.primenet.com(206.165.6.209)
 via SMTP by smtp03.primenet.com, id smtpd011560; Thu Jun 11 20:09:06 1998
Received: (from tlambert@localhost)
	by usr09.primenet.com (8.8.5/8.8.5) id UAA11238;
	Thu, 11 Jun 1998 20:09:02 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199806120309.UAA11238@usr09.primenet.com>
Subject: Re: internationalization
To: itojun@itojun.org (Jun-ichiro itojun Itoh)
Date: Fri, 12 Jun 1998 03:09:02 +0000 (GMT)
Cc: hackers@FreeBSD.ORG
In-Reply-To: <646.897615430@coconut.itojun.org> from "Jun-ichiro itojun Itoh" at Jun 12, 98 10:37:10 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> >I believe it is an error to use multibyte encodings, as they destroy
> >important information which is utilized by 8-bit alphabetic programmers
> >to make user interaction and data storage task drastically less complicated
> >than they would otherwise be.
> 
> 	We got too much computation power already (most of the computers
> 	are just waiting for human input, as shown by the success of
> 	distributed DES challenge).

And we should be spending that processing thime on useful work, not
on rune processing.

Kirk McKusick is often quoted as saying "The number of MIPS delivered
the the keyboard has remained constant since 1976".

Just because the MIPS are there doesn't mean they should be burned
on processing required by an intionally mismatched process and storage
encoding.

Also note that translations of this type are cache busters.  This means
that your processing rate will be limited by your memory bus fetch rate
(33MHz), not your clock-multiplied CPU speed (400MHz).  This is incidently
why I think clock multipliers are the worst idea since DOS.


>	It is just posible to use multibyte
> 	encoding, maybe with some help from library (runelocale for multibyte
> 	encoding).  The problems are:
> 	- there are too many programs that assume character bitwidth
> 	   is proportional to the width on the screen
> 	   (this is also harmful for propotional font handling. GUI people
> 	   already takes care of this)

Proportional fonts are obnoxious.  However, they can be lived with,
using "gravity" notation to hang them based on their right end instead
of the left end (for English; insert your directional coding and
symmetry requirements here).


Not assuming that character bitwidth is proportional to storage encoding
is a bad idea.  It destroys useful information and a number of simplifying
assumptions which bear on computational complexity.  Being in a multibyte
locale, this may not be obvious to you, since using a multibyte locale
destroys the same information.

The fact that Americans are not in a multibyte locale and can make
these simplifying assumptions is one of the competitive advantages of
the American and European software industries.

I would prefer that the Japanese enjoy the American and European
competitive advantage, rather than the Americans and Europeans being
forced to suffer the Japanese disadvantage.


> 	- there are too many programs that assume 8th bit of "char" is
> 	  availble as flag bit
> 	- and more

Most newer software is 8-bit clean.  The obvious offendors were SMTP
and termcap, both of which have since been corrected.  I think this
assumption is not widespread.


> 	I agree that the man-months are eaten for Kanji processing in Japanese
> 	software industry, but I certanly not agree that Japanese should
> 	have been moved to Kana-only world.  How do you think if you are
> 	told to move to 6-letter (yea, A to F) world just to fit letters
> 	and digits into 4bits?


The point is not a reduction in an alphabetic symbol space, as in
your A-F example.

A switch from Kanji to Kana would not damage the ability to represent
any Japanese words; it's a switch from an ideogrammatic to an
alphabetic representation.

The origins of Kanji as an ideogrammatic writing system owe more to
the need for Imperial China to control the availability of persistent
information available to Chinese Serfs in support of a feudal society
than they do to their information density compared to alphabetic
writing systems.


> 	People should not be forced to fit to information processing tools.
> 	Information processing tools must evolve to support the natural thing
> 	people wants to perform.  Because people got Kanji letters already,
> 	we support that.

U.S. Robotics, the inventors of "Graffiti", might disagree with you.

8-).

The "limit English to A through F" argument is really a strawman.

The issue I was addressing was the need for a large educational
infrastructure, front-loading of language skills, and the inability
to phonetically derive words a child knows verbally, but for which
the child has not been taught the symbolic representation.

For alphabetic languages, even badly pseudo-phonetic languages like
English, a child aware of phonics can guess, with a high probability of
success, the symbols needed to represent any given spoken word, and
thus communicate it persistently.  Despite the ill-considered effort
the American educational system historically has made to turn clusters
of English letters into ideograms.

With an ideogrammatic language, the child can not even guess at the
correct word.  I believe it is common to use Katakana dictionaries to
look up Kanji for children in primary language education in Japan...
my copy of "Peach boy Momo" has Katakana superscript for new Kanji
symbols I wasn't previously exposed to.  8-).


I would certainly support switching the English speaking world to
a phonetic alphabet, or other synthetic written language with 100%
regular rules.

It would give us one hell of a competitive advantage, if our kids were
to learn to read and write say 2 years earlier on average, while their
brains were still highly plastic.


In any case, the Japanese problem is not the same.  The Japanese
problem is "how do I put 20,000 symbols on a keyboard smaller than
the largest room in my house?" (that's 200 PC keyboards for us 8-bit
alphabetic language using readers).

Solving this problem by chording is inefficient, and introduces a
third representational geometry that users the language must also
learn (tactile).  Kanji may have a higher information density, but
the absolute information transfer rate from humans to computers is
much higher for English, and will be, until the Kanji handwriting
recognition problem is fully resolved.


> 	The summary based on "do you store your files in xxx?" was nice :-)

Thanks; we're all language bigots at heart.  8-).

I support Unicode encoding because it's advantageous to me personally
in the long term.  It just happens to be advantageous to everyone else
in the long term, as well (which is what makes it advantageous to me
personally in the long term: I will have a soloution to sell when they
come looking for one to buy).

I would also like to solve this problem once (and only once) in such
away that I can make the maximum number of simplifying assumptions
while writing my code (which is the source of my support for the
"fixed 16 bit storage" model).

I'm willing to double my text data memory and disk footprint now in
order to keep from multiplying it by 3, 4, or 5 later because I was
too lazy to consider my potential markets desirable when I made the
short-sighted decision.  Not to mention the conversion overhead going
up the longer documents are being created in anything other than the
final format I'd have to decide on eventually, anyway.  If I have to
pay the costs now, I'd rather pay them now.

I'm not willing to pay the penalty to fully multinationalize all of
my applications, because I fully expect to use COM/DCOM/ActiveX/CORBA
objects and object request brokers to access code that someone else
has expended effort to multinationalize, if I ever decide I need to
support multinational text.


Engineers are basically lazy.  They will write a 200 line program
just to avoid typing the same set of 5 commands another three times,
in the expectation they will (maybe) need to type them again.  That's
where tools like "make" and "lex" and "yacc" come from.

It doesn't make sense sometimes, but it tends to benefit the rest
of the world in the long run...


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message