Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 15 Feb 1999 20:14:40 -0500
From:      "Chuck O'Donnell" <cao@bus.net>
To:        Sue Blake <sue@welearn.com.au>
Cc:        freebsd-questions@FreeBSD.ORG
Subject:   Re: cleaning a text file
Message-ID:  <19990215201440.A11649@milf18.bus.net>
In-Reply-To: <19990216103740.60271@welearn.com.au>; from Sue Blake on Tue, Feb 16, 1999 at 10:37:40AM %2B1100
References:  <19990215201056.19929@welearn.com.au> <Pine.BSF.3.91.990215010943.20451F-100000@dsinw.com> <19990216095232.J2207@lemis.com> <19990216103740.60271@welearn.com.au>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Feb 16, 1999 at 10:37:40AM +1100, Sue Blake wrote:
> On Tue, Feb 16, 1999 at 09:52:32AM +1030, Greg Lehey wrote:
> > On Monday, 15 February 1999 at  1:10:36 -0800, rick hamell wrote:
> > >
> > >> Also, this file has some very long lines which would get truncated
> > >> or unexpectedly wrapped when sent as email. And if there is something
> > >> strange, I have to read it and guess what it should have been.
> > >>
> > >> Maybe someone will come up with something for this particular case.
> > >> I can't believe there's not some little untility for this that's been
> > >> hanging around unloved for years.
> > >
> > > 	Oy! Ok... how does Greg reformat all those emails?
> > 
> > With Emacs.  I have a collection of macros which I'm constantly
> > changing to catch up with new tricks that mailers discover.
> > 
> > To Sue's original question: it depends on what your text looks like.
> > tr(1) will remove characters if you ask it to.
> 
> If I knew which characters were there (so I could ask tr to remove
> them) I would have already removed them with my text editor.
> 
> >  fmt(1) might be useful for wrapping lines.
> 
> I don't see the long line lengths as a big problem at this stage, but
> fmt might be useful later.
> 
> The problem is that I don't know which funny characters exist in the
> file, if any. I want to find out what they are, so I can search for
> them and eyeball them before killing them.
> 
> 
> Just knowing which characters they are would give me many solutions
> immediately. There still doesn't seem to be a way to find this out :-(
> 
> Maybe there's a long way... somehow put a linefeed after each character
> in the file (with sed?) and then sort it and look at the top and bottom
> of the sorted file.
> 

If you just want to find funny chars, how about:

---------------
#!/usr/local/bin/perl

require 5;

$reg = '[^\w\s\$#\@!\`\~\%\^\&\*\(\)+=\|\\\?\<\>,.\/"\':;\{\}-]';

while (<>) {
    while (m/($reg)/og) {
	$p = pos() - 1;
	$c = ord $1;
        ($s = $_) =~ s/$reg/?/og;
        printf "%s%s^ L%d C%d\n", $s, " " x $p, $., $c;
    }
}
---------------

anything not in $reg will marked and replaced with a '?' char. `L'
will show the line number and `C' is the decimal value of the
character. you could probably fix it so it does the right thing on
long lines.

-- 
Chuck

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19990215201440.A11649>