Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 31 Dec 2008 15:20:14 -0500 (EST)
From:      vogelke+software@pobox.com (Karl Vogel)
To:        freebsd-questions@FreeBSD.ORG
Subject:   Re: well, blew it... sed or perl q again.
Message-ID:  <20081231202014.C8012BE14@kev.msw.wpafb.af.mil>
In-Reply-To: <20081230193111.GA32641@thought.org> (message from Gary Kline on Tue, 30 Dec 2008 11:31:14 -0800)

next in thread | previous in thread | raw e-mail | index | archive | help
>> On Tue, 30 Dec 2008 11:31:14 -0800, 
>> Gary Kline <kline@thought.org> said:

G> The problem is that there are many, _many_ embedded "<A
G> HREF="http://whatever>; Site</A> in my hundreds, or thousands, or
G> files.  I only want to delete the "http://<junkfoo.com>" lines, _not_
G> the other Href links.

   Use perl.  You'll want the "i" option to do case-insensitive matching,
   plus "m" for matching that could span multiple lines; the first
   quoted line above shows one of several places where a URL can cross
   a line-break.

   You might want to leave the originals completely alone.  I never trust
   programs to modify files in place:

     you% mkdir /tmp/work
     you% find . -type f -print | xargs grep -li http://junkfoo.com > FILES
     you% pax -rwdv -pe /tmp/work < FILES

   Your perl script can just read FILES and overwrite the stuff in the new
   directory.  You'll want to slurp the entire file into memory so you catch
   any URL that spans multiple lines.  Try the script below, it works for
   input like this:

      This
      <a HREF="http://junkfoo.com">;
             Site</A> should go away too.

      And so should
      <a HREF=
        "http://junkfoo.com/"
      > Site</A> this

      And finally <a HREF="http://junkfoo.com/">Site</A>; this

-- 
Karl Vogel                      I don't speak for the USAF or my company

The average person falls asleep in seven minutes.
                                        --item for a lull in conversation

---------------------------------------------------------------------------
#!/usr/bin/perl -w

use strict;

my $URL = 'href=(.*?)"http://junkfoo.com/*"';
my $contents;
my $fh;
my $infile;
my $outfile;

while (<>) {
    chomp;
    $infile = $_;

    s{^./}{/tmp/};
    $outfile = $_;

    open ($fh, "< $infile") or die "$infile";
    $contents = do { local $/; <$fh> };
    close ($fh);

    $contents =~ s{              # substitute ...
                    <a(.*?)      # ... URL start
                    $URL         # ... actual link
                    (.*?)        # ... min # of chars including newline
                    </a>         # ... until we end
                  }
                  { }gixms;      # ... with a single space

    open ($fh, "> $outfile") or die "$outfile";
    print $fh $contents;
    close ($fh);
}

exit(0);



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081231202014.C8012BE14>