From owner-freebsd-hackers  Thu May  4 04:46:31 1995
Return-Path: hackers-owner
Received: (from majordom@localhost)
          by freefall.cdrom.com (8.6.10/8.6.6) id EAA07459
          for hackers-outgoing; Thu, 4 May 1995 04:46:31 -0700
Received: from silvia.HIP.Berkeley.EDU (silvia.HIP.Berkeley.EDU [136.152.64.181])
          by freefall.cdrom.com (8.6.10/8.6.6) with ESMTP id EAA07453
          ; Thu, 4 May 1995 04:46:26 -0700
Received: (from asami@localhost) by silvia.HIP.Berkeley.EDU (8.6.11/8.6.9) id EAA01328; Thu, 4 May 1995 04:46:20 -0700
Date: Thu, 4 May 1995 04:46:20 -0700
Message-Id: <199505041146.EAA01328@silvia.HIP.Berkeley.EDU>
To: jkh@time.cdrom.com
CC: ache@FreeBSD.org, hackers@FreeBSD.org
In-reply-to: <16984.799556850@time.cdrom.com> (jkh@time.cdrom.com)
Subject: Re: Can someone explain the various forms of Japanese text encoding?
From: asami@cs.berkeley.edu (Satoshi Asami | =?ISO-2022-JP?B?GyRCQHUbKEI=?= =?ISO-2022-JP?B?GyRCOCsbKEIgGyRCOC0bKEI=?=)
Sender: hackers-owner@FreeBSD.org
Precedence: bulk

 * So far I've seen "romanji", which appears to be a romanized form of
                     ^^^^^^^
          this should be "romaji"

 * Japanese, JIS (which is?) and "EUC" (which is?).

JIS (short for "Japan Industrial Standard", which is the Japanese
equivalent of ISO) code is the "real" standard in the sense that (1)
it can coexist with other multi-byte languages, and (2) there is a
"standard" for it (JIS-X-0208).  It uses Esc-$-B to start the Japanese
part and Esc-(-B to end it (i.e., back to ASCII).  In the Japanese
sentence, two bytes denote a single letter.

Other than the Esc for the bookmarks, it uses only the printable ASCII
chars (a subrange of 0x20 - 0x7e).  Note that even in a purely
Japanese document, the end of line is still represented by 0x0a so
there is at least one Esc-$-B and Esc-(-B per line (unless the line
contains no Japanese at all).

EUC ("Extended Unix Code") assumes the entire world is composed of USA
and Japan (typical Japanese thinking :).  Basically, it takes the
JIS-encoded sequence, sets the 8th bit (in both bytes) of the Japanese
part, and rips off the escape sequences.  Thus, if Joerg spells his
name right using the 8th bit, it will get confused.

In both JIS and EUC, the Japanese part is 2 bytes per char, and
usually, the Japanese fonts are twice the width of their ASCII
counterpart.  Thus, on a 80-character wide screen, you can display 40
Japanese letters (for 80 bytes).  Of course JIS is a little longer due
to the escape sequences.

This makes it a little harder to design a program like "less", 'cause
it needs to ignore the escape sequences when it counts the number of
bytes to decide where to fold the lines in JIS -- EUC on the other
hand has the same number of bytes, so the utility doesn't have to
worry about anything as long as the terminal can handle EUC.

People in Japan usually use JIS for communication (it's 7-bit, and
it's also "standard" in a broader sense) and EUC for storage (it's
shorter).

 * 						     I'd like to support
 * the "most standard" type for sysinstall, but I'm a little unclear as
 * to just exactly what that might be.

If we are planning to support NEC's popular (in Japan) PC-9801 series
of computers (with Japanese support built-in in its console), we'll
need to go to the third standard, called "Shift-JIS" (meaning "shifted
JIS") or "MS-Kanji".  This is a truly kludgy format, I don't even want
to try to explain it here and let's not worry about it for now. ;)

I'd say we should eventually go with JIS, EUC is more of a domestic
thing and doesn't coexist well with other multi-byte languages.
Converting between them is easy though.

 * 					Romanji looks like the easiest to
 * display, but it's probably also the least palatable to the native
 * Japanese speaker.

True...it's very very hard for Japanese people to read romaji.  But I
guess it's better than English for many.

 * 		      Given that I also have *no* Japanese fonts for
 * syscons, I'm also somewhat limited in that dept. anyway.  There is a
 * format I can display with the ISO8859-1 font, according to Satoshi,
 * though I'm still a little unclear on how it works.

According to ME?!?  When did I say that? ;)  I don't think that's
possible.... :<

Anyway, since we don't have fonts, I think we are pretty much stuck
with romaji for now.  Oh well. :<

Satoshi