From owner-freebsd-hackers Thu Jun 11 20:09:36 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id UAA14528 for freebsd-hackers-outgoing; Thu, 11 Jun 1998 20:09:36 -0700 (PDT) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from smtp03.primenet.com (daemon@smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id UAA14478 for ; Thu, 11 Jun 1998 20:09:11 -0700 (PDT) (envelope-from tlambert@usr09.primenet.com) Received: (from daemon@localhost) by smtp03.primenet.com (8.8.8/8.8.8) id UAA11581; Thu, 11 Jun 1998 20:09:07 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp03.primenet.com, id smtpd011560; Thu Jun 11 20:09:06 1998 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id UAA11238; Thu, 11 Jun 1998 20:09:02 -0700 (MST) From: Terry Lambert Message-Id: <199806120309.UAA11238@usr09.primenet.com> Subject: Re: internationalization To: itojun@itojun.org (Jun-ichiro itojun Itoh) Date: Fri, 12 Jun 1998 03:09:02 +0000 (GMT) Cc: hackers@FreeBSD.ORG In-Reply-To: <646.897615430@coconut.itojun.org> from "Jun-ichiro itojun Itoh" at Jun 12, 98 10:37:10 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > >I believe it is an error to use multibyte encodings, as they destroy > >important information which is utilized by 8-bit alphabetic programmers > >to make user interaction and data storage task drastically less complicated > >than they would otherwise be. > > We got too much computation power already (most of the computers > are just waiting for human input, as shown by the success of > distributed DES challenge). And we should be spending that processing thime on useful work, not on rune processing. Kirk McKusick is often quoted as saying "The number of MIPS delivered the the keyboard has remained constant since 1976". Just because the MIPS are there doesn't mean they should be burned on processing required by an intionally mismatched process and storage encoding. Also note that translations of this type are cache busters. This means that your processing rate will be limited by your memory bus fetch rate (33MHz), not your clock-multiplied CPU speed (400MHz). This is incidently why I think clock multipliers are the worst idea since DOS. > It is just posible to use multibyte > encoding, maybe with some help from library (runelocale for multibyte > encoding). The problems are: > - there are too many programs that assume character bitwidth > is proportional to the width on the screen > (this is also harmful for propotional font handling. GUI people > already takes care of this) Proportional fonts are obnoxious. However, they can be lived with, using "gravity" notation to hang them based on their right end instead of the left end (for English; insert your directional coding and symmetry requirements here). Not assuming that character bitwidth is proportional to storage encoding is a bad idea. It destroys useful information and a number of simplifying assumptions which bear on computational complexity. Being in a multibyte locale, this may not be obvious to you, since using a multibyte locale destroys the same information. The fact that Americans are not in a multibyte locale and can make these simplifying assumptions is one of the competitive advantages of the American and European software industries. I would prefer that the Japanese enjoy the American and European competitive advantage, rather than the Americans and Europeans being forced to suffer the Japanese disadvantage. > - there are too many programs that assume 8th bit of "char" is > availble as flag bit > - and more Most newer software is 8-bit clean. The obvious offendors were SMTP and termcap, both of which have since been corrected. I think this assumption is not widespread. > I agree that the man-months are eaten for Kanji processing in Japanese > software industry, but I certanly not agree that Japanese should > have been moved to Kana-only world. How do you think if you are > told to move to 6-letter (yea, A to F) world just to fit letters > and digits into 4bits? The point is not a reduction in an alphabetic symbol space, as in your A-F example. A switch from Kanji to Kana would not damage the ability to represent any Japanese words; it's a switch from an ideogrammatic to an alphabetic representation. The origins of Kanji as an ideogrammatic writing system owe more to the need for Imperial China to control the availability of persistent information available to Chinese Serfs in support of a feudal society than they do to their information density compared to alphabetic writing systems. > People should not be forced to fit to information processing tools. > Information processing tools must evolve to support the natural thing > people wants to perform. Because people got Kanji letters already, > we support that. U.S. Robotics, the inventors of "Graffiti", might disagree with you. 8-). The "limit English to A through F" argument is really a strawman. The issue I was addressing was the need for a large educational infrastructure, front-loading of language skills, and the inability to phonetically derive words a child knows verbally, but for which the child has not been taught the symbolic representation. For alphabetic languages, even badly pseudo-phonetic languages like English, a child aware of phonics can guess, with a high probability of success, the symbols needed to represent any given spoken word, and thus communicate it persistently. Despite the ill-considered effort the American educational system historically has made to turn clusters of English letters into ideograms. With an ideogrammatic language, the child can not even guess at the correct word. I believe it is common to use Katakana dictionaries to look up Kanji for children in primary language education in Japan... my copy of "Peach boy Momo" has Katakana superscript for new Kanji symbols I wasn't previously exposed to. 8-). I would certainly support switching the English speaking world to a phonetic alphabet, or other synthetic written language with 100% regular rules. It would give us one hell of a competitive advantage, if our kids were to learn to read and write say 2 years earlier on average, while their brains were still highly plastic. In any case, the Japanese problem is not the same. The Japanese problem is "how do I put 20,000 symbols on a keyboard smaller than the largest room in my house?" (that's 200 PC keyboards for us 8-bit alphabetic language using readers). Solving this problem by chording is inefficient, and introduces a third representational geometry that users the language must also learn (tactile). Kanji may have a higher information density, but the absolute information transfer rate from humans to computers is much higher for English, and will be, until the Kanji handwriting recognition problem is fully resolved. > The summary based on "do you store your files in xxx?" was nice :-) Thanks; we're all language bigots at heart. 8-). I support Unicode encoding because it's advantageous to me personally in the long term. It just happens to be advantageous to everyone else in the long term, as well (which is what makes it advantageous to me personally in the long term: I will have a soloution to sell when they come looking for one to buy). I would also like to solve this problem once (and only once) in such away that I can make the maximum number of simplifying assumptions while writing my code (which is the source of my support for the "fixed 16 bit storage" model). I'm willing to double my text data memory and disk footprint now in order to keep from multiplying it by 3, 4, or 5 later because I was too lazy to consider my potential markets desirable when I made the short-sighted decision. Not to mention the conversion overhead going up the longer documents are being created in anything other than the final format I'd have to decide on eventually, anyway. If I have to pay the costs now, I'd rather pay them now. I'm not willing to pay the penalty to fully multinationalize all of my applications, because I fully expect to use COM/DCOM/ActiveX/CORBA objects and object request brokers to access code that someone else has expended effort to multinationalize, if I ever decide I need to support multinational text. Engineers are basically lazy. They will write a 200 line program just to avoid typing the same set of 5 commands another three times, in the expectation they will (maybe) need to type them again. That's where tools like "make" and "lex" and "yacc" come from. It doesn't make sense sometimes, but it tends to benefit the rest of the world in the long run... Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message