Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Apr 2002 08:38:22 -0400
From:      "Dan Langille" <dan@langille.org>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        chat@freebsd.org
Subject:   Re: what are these characters please?
Message-ID:  <20020411123917.6F2B93F30@bast.unixathome.org>
In-Reply-To: <3CB571D6.2C10B9AA@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 11 Apr 2002 at 4:21, Terry Lambert wrote:

> The character sets selected are documented in ANSI 3.64; you can
> also find them in the VT220 and VT320 programming guides.  Given
> that the committer was likely using EUC encoding for JIS-208, it
> seems unrecoverable.
> 
> Most likely, you are going to have to live with it.

I think I'll just remove the "offending" characters.  I've found two 
solutions, each of which produces the same result:

$ tr -d '\001'-'\011''\013''\014''\016'-'\037''\200'-'\377'  < xml.txt > 
xml3.txt
$ diff xml3.txt xml.txt
14c14
<         [Submitted by: Ville Skytt,Ad(B &lt;ville.skytta@iki.fi&gt;]
---
>         [Submitted by: Ville Skyttd &lt;ville.skytta@iki.fi&gt;]

$ cat xml.txt | sed -e 's/[^ -~][^ -~]*//g' > xml5.txt
$ diff xml5.txt xml.txt
14c14
<         [Submitted by: Ville Skytt,Ad(B &lt;ville.skytta@iki.fi&gt;]
---
>         [Submitted by: Ville Skyttd &lt;ville.skytta@iki.fi&gt;]

I think I'll go with the above regex and add it to my perl script.

Does anyone have any suggestions?

Tony: my thanks for your replies.  It has been useful in understanding the 
problem.


-- 
Dan Langille
The FreeBSD Diary - http://freebsddiary.org/ - practical examples


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020411123917.6F2B93F30>