Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Aug 1998 01:42:55 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        stb@hanse.de (Stefan Bethke)
Cc:        tlambert@primenet.com, archie@whistle.com, freebsd-hackers@FreeBSD.ORG
Subject:   Re: Warning: Change to netatalk's file name handling
Message-ID:  <199808280142.SAA06707@usr07.primenet.com>
In-Reply-To: <Pine.BSF.3.96.980828001312.24324C-100000@transit.hanse.de> from "Stefan Bethke" at Aug 28, 98 00:36:04 am

next in thread | previous in thread | raw e-mail | index | archive | help
> > > Where is the encoding defined for character values in the ranges between
> > > \0x01 to \0x1f, and \0x7f to \0xff in terms of UFS, POSIX, whatever?
> > 
> > ISO 8859?
> 
> Is this a standardized encoding for POSIX file names, or just a
> convention? If it only is a convention, what will non-latin script users
> think about it? How do we discriminate between different 8859 encodings?
> (Yeah, I see your point about "locales".)

The UNIX server does *not* discriminate.  The UNIX server (naievely)
expects the locale to be set to the correct value for the data
that is to be viewed.  In other words, it expects that the user
knows what they are doing.

Ideally, at some point, we will cut over to a character encoding
standard that doesn't force a server (acting as a user of the UNIX
system) to chose a single locale for all clients of the server.

For example, Unicode.

This will require that we provide a wchar_t interface to the named
objects in the file system, and will require expanding the directory
block size to 1k, at a minimum, to support 256 character path components,
where a character is 2 octets (ideally) or 4 octets (to satisfy people
who optimistically believe that the other code pages in ISO 10646 will
be defined so they can "grep -v" to avoid seeing text not in their
own language).

Of course, for NFS interoperability with existing systems, you will
need to "attribute" NFS mounts of legacy servers with an 8-bit
character encoding so it can be round-tripped in and out of Unicode.


So the short answer is "character = single octet, encoding undefined"
for now and "character = multiple octet, encoding undefined" for later.


> > Per interoperability: This presumes, incorrectly, that Mac's support
> > the same idiotic idea of code pages as SAMBA must.
> 
> Macs, in this sense, use a single "code page." I believe there is an escape
> mechanism to change the encoding to non-latin scripts, but I will have to
> look that up in Inside Mac. For AFP 2.1 (which netatalk claims to support to
> the extent the Macs use it), there is a single encoding defined, without any
> escape mechanism.

Escaping mechanisms are "in-band".  This is why ISO-2022 encoded
Japanese looked the same between SAMBA and AppleTalk for Archie.


> > > It won't change anything to the worse; the only problem is that
> > > existing files with file names containing control characters
> > > (custom icons on folders being the single source of such name
> > > probably) will stop working and will need manual assistance from
> > > an operator.
> > 
> > It will break a number of things.  It already breaks the file name
> > length limitation in SAMBA.  Duplicating this break into Appletalk is,
> > IMO, a bad idea.
> 
> I don't know much about SMB/CIFS/Samba. What is the filename length limit
> (as opposed, possibly, to the pathname limit)?

256 characters.  On UNIX, it's 256 characters.

Any escaping in-band steals characters from the end of the name, and
reduces the possible length for files with escaped characters in them.

This is bad.


> > If you are going to push this hard, you should consider Internataional
> > representation ofile names by client locale, and how it is already
> > handled.
> 
> Would you mind to point me to any information shedding light on
> standardisation efforts for file name representation? In terms of "locale",
> this would mean that "Mac" or "AFP" would be it's own locale in terms of
> file name character encoding?

Well, generic information is ISO 2022, ISO 10646 (or the Unicode Standard),
EUC, and RFC2152.

I would go light on RFC2152; UTF-7, which is what it documents, is
another encoding mechanism that destroys the ability to predict
the buffer size necessary to back fixed fields.  Such standards
are evil, especially since we are stuck with using fixed maximums
for number of characters in a filename.

ISO 2022 has the same problem, of course, but with each character
representing an ideogram instead of an alphabetic letter, information
density is high enough that a 3:1 fixed length limit is not much
of a hardship.


> After all, I see three possible ways:
> 
> - improve interoperability by confining to printable ASCII (or ISO-8859-1,
>   or...) and not escaping other glyphs, thus breaking AFP conformance;

Don't break AFP conformance.

> - escaping all glyphs (or rather their encoding) in a way that preserves the
>   full AFP filename encoding space (for filenames, this is 0x01 to 0xff,
>   with ":" being illegal as it is the path delimiter), but using printable
>   ASCII where possible (this is, I believe, what netatalk tries to do, but
>   doesn't, due to a stupid bug).

The "/" (/ is volume delimiter; this was the info I missed earlier) for
":" trade is a relatively easy call; it's simple and obvious, and you
are exporting by volume, so a volume seperator is not needed in-band.
So this is an easy fix for an easy bug.


> - translate the AFP filename encoding space into some larger glyph encoding
>   space, such as Unicode, or, more specifically, UTF-8.

This is evil for the reasons outlined above.


> The last one probably is the way to go, but this would require (at least to
> me) some testimonial that Unicode in general and UTF-8 in particular is the
> way to go for file names in FreeBSD. This of course would probably start
> other interop problems with NFS and alike, and it would require samba to
> deal with CP bogosities in its own right instead of putting it in the face
> of every other app.

People from the US will like UTF-8, since it leaves ASCII alone.  It
pretty much screws everyone else in the 0x80-0xff space (ie: all ISO
8859 family using nations, and KOI8-R, etc., as well as shift-encoding,
per Japan/Taiwan/China/Korea) though, because it means all of their
existing data would need to be translated.

This is exactly the problem of trying to move from "coding undefined"
to "coding defined".  This is why Unicode, which leaves the interpretation
of the locale to the rendering engines choice of fonts, is so UNIXy.

I really can't see people doing this, or worse, us doing a very
complicated version of this in the kernel, in order to be able
to use existing data and NFS mounts from legacy systems.

For some educational uses in Europe, you would be asking them to
convert their entire country simultaneously.  In Japan, you would
be asking, minimally, for institutional conversion.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199808280142.SAA06707>