Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 6 May 2009 13:43:05 -0400
From:      Garrett Wollman <wollman@csail.mit.edu>
To:        Oliver Fromme <olli@lurza.secnetix.de>
Cc:        freebsd-standards@freebsd.org, juli@clockworksquid.com
Subject:   Re: Shouldn't cat(1) use the C locale?
Message-ID:  <18945.52265.44038.498643@khavrinen.csail.mit.edu>
In-Reply-To: <200905061707.n46H7jqs042942@lurza.secnetix.de>
References:  <18945.44648.875780.605560@khavrinen.csail.mit.edu> <200905061707.n46H7jqs042942@lurza.secnetix.de>

next in thread | previous in thread | raw e-mail | index | archive | help
<<On Wed, 6 May 2009 19:07:45 +0200 (CEST), Oliver Fromme <olli@lurza.secnetix.de> said:

> Normally cat is agnostic of the encoding of its input data,
> because it is handled like binary data.  But if the -v
> option is used, it has to actually look at the data in
> order to decide what is printable and what is not.
> This has two consequences:  First, it has to know the
> encoding of the input, and second, it has to know what
> is considered "printable".

I think that should be fairly obvious: the input is a stream of bytes,
which may or may not encode characters in any locale.

> The same is true for binary files.  For example, if you have
> a binary with embedded ISO8859 strings that you want to display
> on a UTF8 terminal, then the following works:
> LC_CTYPE=en_US.ISO8859-1 cat -v file | recode iso8859-1..utf8
> It correctly displays German Umlauts and some other characters,
> but escapes 8bit characters that are non-printable in the
> ISO8859-1 locale.

Now try the same thing on a binary with UTF-8 strings in it.

(UTF-8 at least gives you a validity constraint on possible multibyte
characters, which arbitrary multibyte encodings do not necessarily
provide.  This mitigates the "reading frame" problem, because the
first byte of an actual UTF-8 character cannot be the n'th byte of any
UTF-8 character.)

-GAWollman



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?18945.52265.44038.498643>