From owner-freebsd-hackers Thu May 4 04:46:31 1995 Return-Path: hackers-owner Received: (from majordom@localhost) by freefall.cdrom.com (8.6.10/8.6.6) id EAA07459 for hackers-outgoing; Thu, 4 May 1995 04:46:31 -0700 Received: from silvia.HIP.Berkeley.EDU (silvia.HIP.Berkeley.EDU [136.152.64.181]) by freefall.cdrom.com (8.6.10/8.6.6) with ESMTP id EAA07453 ; Thu, 4 May 1995 04:46:26 -0700 Received: (from asami@localhost) by silvia.HIP.Berkeley.EDU (8.6.11/8.6.9) id EAA01328; Thu, 4 May 1995 04:46:20 -0700 Date: Thu, 4 May 1995 04:46:20 -0700 Message-Id: <199505041146.EAA01328@silvia.HIP.Berkeley.EDU> To: jkh@time.cdrom.com CC: ache@FreeBSD.org, hackers@FreeBSD.org In-reply-to: <16984.799556850@time.cdrom.com> (jkh@time.cdrom.com) Subject: Re: Can someone explain the various forms of Japanese text encoding? From: asami@cs.berkeley.edu (Satoshi Asami | =?ISO-2022-JP?B?GyRCQHUbKEI=?= =?ISO-2022-JP?B?GyRCOCsbKEIgGyRCOC0bKEI=?=) Sender: hackers-owner@FreeBSD.org Precedence: bulk * So far I've seen "romanji", which appears to be a romanized form of ^^^^^^^ this should be "romaji" * Japanese, JIS (which is?) and "EUC" (which is?). JIS (short for "Japan Industrial Standard", which is the Japanese equivalent of ISO) code is the "real" standard in the sense that (1) it can coexist with other multi-byte languages, and (2) there is a "standard" for it (JIS-X-0208). It uses Esc-$-B to start the Japanese part and Esc-(-B to end it (i.e., back to ASCII). In the Japanese sentence, two bytes denote a single letter. Other than the Esc for the bookmarks, it uses only the printable ASCII chars (a subrange of 0x20 - 0x7e). Note that even in a purely Japanese document, the end of line is still represented by 0x0a so there is at least one Esc-$-B and Esc-(-B per line (unless the line contains no Japanese at all). EUC ("Extended Unix Code") assumes the entire world is composed of USA and Japan (typical Japanese thinking :). Basically, it takes the JIS-encoded sequence, sets the 8th bit (in both bytes) of the Japanese part, and rips off the escape sequences. Thus, if Joerg spells his name right using the 8th bit, it will get confused. In both JIS and EUC, the Japanese part is 2 bytes per char, and usually, the Japanese fonts are twice the width of their ASCII counterpart. Thus, on a 80-character wide screen, you can display 40 Japanese letters (for 80 bytes). Of course JIS is a little longer due to the escape sequences. This makes it a little harder to design a program like "less", 'cause it needs to ignore the escape sequences when it counts the number of bytes to decide where to fold the lines in JIS -- EUC on the other hand has the same number of bytes, so the utility doesn't have to worry about anything as long as the terminal can handle EUC. People in Japan usually use JIS for communication (it's 7-bit, and it's also "standard" in a broader sense) and EUC for storage (it's shorter). * I'd like to support * the "most standard" type for sysinstall, but I'm a little unclear as * to just exactly what that might be. If we are planning to support NEC's popular (in Japan) PC-9801 series of computers (with Japanese support built-in in its console), we'll need to go to the third standard, called "Shift-JIS" (meaning "shifted JIS") or "MS-Kanji". This is a truly kludgy format, I don't even want to try to explain it here and let's not worry about it for now. ;) I'd say we should eventually go with JIS, EUC is more of a domestic thing and doesn't coexist well with other multi-byte languages. Converting between them is easy though. * Romanji looks like the easiest to * display, but it's probably also the least palatable to the native * Japanese speaker. True...it's very very hard for Japanese people to read romaji. But I guess it's better than English for many. * Given that I also have *no* Japanese fonts for * syscons, I'm also somewhat limited in that dept. anyway. There is a * format I can display with the ISO8859-1 font, according to Satoshi, * though I'm still a little unclear on how it works. According to ME?!? When did I say that? ;) I don't think that's possible.... :< Anyway, since we don't have fonts, I think we are pretty much stuck with romaji for now. Oh well. :< Satoshi