Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Dec 2001 03:30:37 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Maxim Sobolev <sobomax@FreeBSD.org>
Cc:        Liu Siwei <swliu77@hotmail.com>, current@FreeBSD.org
Subject:   Re: Hi,All
Message-ID:  <3C173FDD.5CB96DE4@mindspring.com>
References:  <F43Fjxe0ymQ1WE9ql2R0000061b@hotmail.com> <3C170F56.435FA023@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Maxim Sobolev wrote:
> Liu Siwei wrote:
> >    I love FreeBSD! But.. Can it support CD-RW disc and Simplie Chinese
> > Filename? A lot of files in CD-ROM that have Chinese name, how can i open it
> > under FreeBSD? Oh...Oh....
> 
> What is the official name for Simplie Chinese codepage? If it is a
> 1-byte charset, then I could probably add support for it into
> cd9660_unicode ports, which would allow accessing files with such
> filenames on them.

The most common character sets for Chinese are:

	GB-2312		Simplified Chinese
	EUC-TW		Traditional Chinese
	Big-5		Traditional Chinese

The one in most common use is Big-5.

Unicode supports Chinese through its CJK unification, and can have
characters in it round-tripped into any of the above character set
standards.

All of these are multibyte character sets.  Unfortunately, UTF-7
and UTF-8 tend to be used with Unicode, which destroys fixed field
storage of data (since any character can take up to 5 bytes to
store, depending on its code point, when UTF encoding is used).

The answer to the original question is "it depends on how the
Chinese character data is stored on the CDROM".

If the storage is as multibyte, then decoding it is the job of the
rendering engine.  In other words, you leave it alone, and use a
Chinese display program and input method for X Windows, and it will
"just work".

If the storage is as Unicode code points, then, since tty interfaces
are currently single byte, then you would need to have a converter
program between the FS and the directory code, to convert it to
multibyte, so that when you list the directory, you get the multibyte
values out, and that, in turn, is rendered by the Chinese capable
multibyte program (xterm/etc.).

Right now, FreeBSD does not convert to/from Unicode 2/4 byte encoding
(Windows uses 2 byte encoding, as does Joliet, the Windows CDROM
standard, which *is* supported by FreeBSD); it merely masks off the
high byte of the two bytes, taking advantage of the fact that the
first 256 bytes of Unicode is identical to ISO 8859-1 (Latin-1).  You
would need to be able to throw down round trip tables (probably via
an ioctl() to load them) to the kernel (this is what Windows does).

Note that because of the expansion requirements, it's possible to
have 256 Unicode stored Chinese characters bloat to 1280 characters,
which exceeds both the maximum file name component length (256) and
path name length (1024) set by UNIX (and copied by FreeBSD).  It's
highly unlikely that anyone has encoded this type of data, but the
possibility is there.

Ignoring all that, you should be able to do a lookup from the
Unicode table to to, say, Big-5, and back, with one 64K table in
each direction, and EUC or otherwise multibyte encode the result
before returning it via getdirentries().  This will break under
Linux emulation (I believe), since it uses the directory lookup
restart code, which will be variant under multibyte translation.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C173FDD.5CB96DE4>