Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Apr 2002 15:11:13 +0000 (UTC)
From:      naddy@mips.inka.de (Christian Weisgerber)
To:        freebsd-chat@freebsd.org
Subject:   Re: what are these characters please?
Message-ID:  <a9492h$2g43$1@kemoauc.mips.inka.de>
References:  <3CB571D6.2C10B9AA@mindspring.com> <20020411113858.E48BB3F30@bast.unixathome.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Dan Langille <dan@langille.org> wrote:

> I have to find a solution as non-ISO-8859-1 are causing grief when it 
> comes to reading in the XML.  See below.

Note that there is stuff in the commit logs that is valid but doesn't
make sense in ISO 8859-1 encoding.  For example, somebody by the
name of "Slaven Rezi<E6>" is credited.  I very much doubt that the
final character is really ae ligature (as per 8859-1); c with acute
(8859-2) seems more plausible.  It gets worse for Cyrillic names.

So if you assume the input to be ISO-8859-1-encoded, you will
preserve the stuff that was actually input in 8859-1 but totally
screw up the stuff that was originally input in some other encoding.

> I'm not at all worried about restoring the original text.  I'm going for a 
> "ignore what I can't use"-solution.

Okay.

> I think my goal here is remove all non-ISO-8859-1 characters from the 
> incoming cvs-all message.

It makes more sense to clobber everything that isn't ASCII.

chomp($line);
$line ~= tr/\x09\x20-\x7E/?/c;	# tab, printable ASCII

Putting a replacement character such as '?' or '#' there is probably
less confusing than outright deleting the offending bytes.

-- 
Christian "naddy" Weisgerber                          naddy@mips.inka.de


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?a9492h$2g43$1>