Date: Thu, 11 Apr 2002 08:38:22 -0400 From: "Dan Langille" <dan@langille.org> To: Terry Lambert <tlambert2@mindspring.com> Cc: chat@freebsd.org Subject: Re: what are these characters please? Message-ID: <20020411123917.6F2B93F30@bast.unixathome.org> In-Reply-To: <3CB571D6.2C10B9AA@mindspring.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 11 Apr 2002 at 4:21, Terry Lambert wrote: > The character sets selected are documented in ANSI 3.64; you can > also find them in the VT220 and VT320 programming guides. Given > that the committer was likely using EUC encoding for JIS-208, it > seems unrecoverable. > > Most likely, you are going to have to live with it. I think I'll just remove the "offending" characters. I've found two solutions, each of which produces the same result: $ tr -d '\001'-'\011''\013''\014''\016'-'\037''\200'-'\377' < xml.txt > xml3.txt $ diff xml3.txt xml.txt 14c14 < [Submitted by: Ville Skytt,Ad(B <ville.skytta@iki.fi>] --- > [Submitted by: Ville Skyttd <ville.skytta@iki.fi>] $ cat xml.txt | sed -e 's/[^ -~][^ -~]*//g' > xml5.txt $ diff xml5.txt xml.txt 14c14 < [Submitted by: Ville Skytt,Ad(B <ville.skytta@iki.fi>] --- > [Submitted by: Ville Skyttd <ville.skytta@iki.fi>] I think I'll go with the above regex and add it to my perl script. Does anyone have any suggestions? Tony: my thanks for your replies. It has been useful in understanding the problem. -- Dan Langille The FreeBSD Diary - http://freebsddiary.org/ - practical examples To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020411123917.6F2B93F30>