From owner-freebsd-chat Thu Apr 11 13:40:27 2002 Delivered-To: freebsd-chat@freebsd.org Received: from harrier.prod.itd.earthlink.net (harrier.mail.pas.earthlink.net [207.217.120.12]) by hub.freebsd.org (Postfix) with ESMTP id EA12837B404 for ; Thu, 11 Apr 2002 13:40:23 -0700 (PDT) Received: from pool0116.cvx40-bradley.dialup.earthlink.net ([216.244.42.116] helo=mindspring.com) by harrier.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 16vlMi-0001oj-00; Thu, 11 Apr 2002 13:40:21 -0700 Message-ID: <3CB5F49B.C21B24E9@mindspring.com> Date: Thu, 11 Apr 2002 13:39:55 -0700 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Christian Weisgerber Cc: freebsd-chat@freebsd.org Subject: Re: what are these characters please? References: <3CB571D6.2C10B9AA@mindspring.com> <20020411113858.E48BB3F30@bast.unixathome.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Christian Weisgerber wrote: > > I think my goal here is remove all non-ISO-8859-1 characters from the > > incoming cvs-all message. > > It makes more sense to clobber everything that isn't ASCII. > > chomp($line); > $line ~= tr/\x09\x20-\x7E/?/c; # tab, printable ASCII > > Putting a replacement character such as '?' or '#' there is probably > less confusing than outright deleting the offending bytes. In this case, it's probably ISO 2022 based EUC encoding for JIS-208, so it's not going to be relevent anyway, since what has to be replaced is a chacter set change sequence, a character, and a change back. In this particular case, the advice about non-printable ASCII characters doesn't work, either, since it will only swallow the , and not the rest of the sequence or the terminator. Living with it -- or stripping the control characters -- is probably the only thing that will work. The character set encoding information was lost when the cut-and-paste happened (this is a good argument for Unicode, *NOT* UTF-8, and 16 bit wchar_t). In this case, stripping the escape sequence leaves a "d", and stripping the non-printable ISO-8859-1 or ASCII leaves a ",Ad(B". -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message