Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Apr 2002 14:52:24 -0400
From:      "Dan Langille" <dan@langille.org>
To:        freebsd-chat@freebsd.org
Subject:   CVS log encoding (was Re: what are these characters please?)
Message-ID:  <20020411185322.E34203F30@bast.unixathome.org>
In-Reply-To: <a947md$2fli$1@kemoauc.mips.inka.de>

next in thread | previous in thread | raw e-mail | index | archive | help
I have combined your two replies into one message.

On 11 Apr 2002 at 14:47, Christian Weisgerber wrote:

> Dan Langille <dan@langille.org> wrote:
> 
> > > Well what encoding do your XML documents use?
> > 
> > It was UTF-8.  Some months ago it changed to ISO-8859-1 when I first
> > encountered this type of issue (back then it was Lyngb<F8>l).
> 
> Seems like a bad choice to me, because how are you now going to
> handle characters outside the meager repertoire of ISO 8859-1?
> 
> > Given that the incoming characters are supposed to be ISO-8859-1 (which
> > is what CVS stores (see Tony's message),
>                        Terry
> This is wrong. CVS stores byte streams. There is no implied character
> set. Nor is there a way to tag any data or CVS meta data with a
> character set.

[sorry Terry; I worked with a chap in New Zealand by the name of Tony 
Lamberton and whenever I see your name...]

> You can _by convention_ decide that all data stored in a particular
> CVS repository is to be interpreted in the <mumble> character set,
> but I'm not aware of such a convention being in place for FreeBSD.

If there is no convention, then it will be up to me to pick an encoding 
and stick with it.

> > I'm quite sure the best thing to do is just ignore the non-standard
> > characters (i.e. by removing them).  What's your view on that approach?
> 
> I still don't know quite what you are trying to accomplish.  Are
> you looking for a purely mechanical solution?  Or are you prepared
> to do manual fix-ups?  Do strive for accuracy?  Or do you only want
> to quickly crunch data and don't care if people's names are mutilated?

The goal is to accurately reflect the cvs log (see 
http://test.freshports.org for the beta set).  But since I've started to 
encounter these characters which are causing strife, I'm willing to take 
what I can get.

> Since CVS doesn't store character set information, anything outside
> the printable ASCII range (0x20..0x7E) is *undefined* and thus
> basically an error condition.  There are two ways to deal with this:
> 
> 1. You can just automatically strip the characters (or replace them
>    by a placeholder like '?' or such) and get on.  This will mutilate
>    some names, but since the input is already undefined, you can
>    argue that you really won't do any further damage anyway.
>
> 2. You can manually try to figure out what those characters are and
>    fix them up in one of several ways: replace by UTF-8, convert
>    to ASCII-only, etc.

I like a combination of the two:

 - Fix any characters which are outside the chosen encoding and save the
   data immediately.  Flag the record as having been altered.
 - Optionally fix flagged records at some future date

This will achieve the primary goal of always having up-to-date information 
and [optionally] achieve a not-so-primary goal of having accurate data.

> If you go with (1), I strongly suggest that you kill everything
> outside ASCII and do not consider the input to be ISO 8859-1.
> Grepping over the FreeBSD commit logs, I see names that, although
> technically valid ISO 8859-1 sequences, were clearly input in ISO
> 8859-2 or KOI-8R environments.

Thank you for grepping those logs for me.  It would be good if we could 
have one encoding which covers all possible characters.  I think I'll 
settle for the UTF-8 encoding (unless you can recommend another).

On 11 Apr 2002 at 15:11, Christian Weisgerber wrote:

> Dan Langille <dan@langille.org> wrote:
> 
> > I have to find a solution as non-ISO-8859-1 are causing grief when it
> > comes to reading in the XML.  See below.
> 
> Note that there is stuff in the commit logs that is valid but doesn't
> make sense in ISO 8859-1 encoding.  For example, somebody by the
> name of "Slaven Rezi<E6>" is credited.  I very much doubt that the
> final character is really ae ligature (as per 8859-1); c with acute
> (8859-2) seems more plausible.  It gets worse for Cyrillic names.

I'm beginning to see the extent of the problem.

> So if you assume the input to be ISO-8859-1-encoded, you will
> preserve the stuff that was actually input in 8859-1 but totally
> screw up the stuff that was originally input in some other encoding.

That points at using something like UTF-8 I think.

> > I'm not at all worried about restoring the original text.  I'm going 
> > for a  "ignore what I can't use"-solution.
> 
> Okay.
> 
> > I think my goal here is remove all non-ISO-8859-1 characters from the
> > incoming cvs-all message.
> 
> It makes more sense to clobber everything that isn't ASCII.
> 
> chomp($line);
> $line ~= tr/\x09\x20-\x7E/?/c;	# tab, printable ASCII
> 
> Putting a replacement character such as '?' or '#' there is probably
> less confusing than outright deleting the offending bytes.

Good point.  That will ease the manual fix-up process too.

-- 
Dan Langille
The FreeBSD Diary - http://freebsddiary.org/ - practical examples


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020411185322.E34203F30>