Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Apr 2002 13:39:55 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Christian Weisgerber <naddy@mips.inka.de>
Cc:        freebsd-chat@freebsd.org
Subject:   Re: what are these characters please?
Message-ID:  <3CB5F49B.C21B24E9@mindspring.com>
References:  <3CB571D6.2C10B9AA@mindspring.com> <20020411113858.E48BB3F30@bast.unixathome.org> <a9492h$2g43$1@kemoauc.mips.inka.de>

next in thread | previous in thread | raw e-mail | index | archive | help
Christian Weisgerber wrote:
> > I think my goal here is remove all non-ISO-8859-1 characters from the
> > incoming cvs-all message.
> 
> It makes more sense to clobber everything that isn't ASCII.
> 
> chomp($line);
> $line ~= tr/\x09\x20-\x7E/?/c;  # tab, printable ASCII
> 
> Putting a replacement character such as '?' or '#' there is probably
> less confusing than outright deleting the offending bytes.

In this case, it's probably ISO 2022 based EUC encoding for
JIS-208, so it's not going to be relevent anyway, since what
has to be replaced is a chacter set change sequence, a character,
and a change back.

In this particular case, the advice about non-printable ASCII
characters doesn't work, either, since it will only swallow the
<ESC>, and not the rest of the sequence or the terminator.

Living with it -- or stripping the control characters -- is
probably the only thing that will work.

The character set encoding information was lost when the
cut-and-paste happened (this is a good argument for Unicode,
*NOT* UTF-8, and 16 bit wchar_t).

In this case, stripping the escape sequence leaves a "d",
and stripping the non-printable ISO-8859-1 or ASCII leaves
a ",Ad(B".

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3CB5F49B.C21B24E9>