Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 15 May 2007 18:46:27 -0700
From:      Garrett Cooper <youshi10@u.washington.edu>
To:        Gary Kline <kline@tao.thought.org>
Cc:        Ian Smith <smithi@nimnet.asn.au>, freebsd-questions@freebsd.org
Subject:   Re: what's the easiest way to de-html-ize files?
Message-ID:  <464A6273.8080705@u.washington.edu>
In-Reply-To: <20070514022655.GA1304@thought.org>
References:  <20070514210933.1024A16A478@hub.freebsd.org>	<Pine.BSF.3.96.1070515152444.7949B-100000@gaia.nimnet.asn.au> <20070514022655.GA1304@thought.org>

Next in thread | Previous in thread | Raw E-Mail | Index | Archive | Help
Gary Kline wrote:
> On Tue, May 15, 2007 at 03:34:14PM +1000, Ian Smith wrote:
>> On Sat, 12 May 2007 14:34:52 -0700 Gary Kline <kline@tao.thought.org> wrote:
>>  > On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote:
>>  > > On May 12, 2007, at 12:54 PM, Gary Kline wrote:
>>  > > >This is for those of us who appreciate ASCII or straight
>>  > > >	ISO_8859-15 rather than marked up files.  I have slapped together
>>  > > >	a crude C program that does scotch (or *cleanse*) text of
>>  > > >	<B></B> and so on.   Still... is there some standalone converter
>>  > > >	that gets rids of markup more elegantly?   Something where i
>>  > > >	can say
>>  > > >
>>  > > >	% cmd file_1.html ... file_N.html and output file_1.text ...
>>  > > >	file_N.text?
>>  > > 
>>  > > Perhaps:
>>  > > 
>>  > >   lynx -dump file1.html ... > file.text
>>  > > 
>>  > > ...?
>>  > 
>>  > 	Hm, maybe Ineed Bill Campbell's -force_html switch.  
>>  > 
>>  > 	Yes, seems that way.  USing just -dump got most of them, but
>>  > 	using the -force_html caught all.  Need to script something to
>>  > 	reformat, but the worst of it's done!
>>
>> Also, if using Mozilla (so, I would assume, Firefox) the 'Save Page As'
>> dialog offers a picklist for 'Files of Type' that includes 'Text Files'.
>>
>> This does a pretty decent job of producing text from HTML files, and is
>> quicker than firing up lynx (or links) if you're already viewing a page.
> 
> 
> 	Oh sure; I've been saving html in text, ascii/8859-1 for years.
> 	But what I've got, and there are more saved **somewhere**, are
> 	files that are saved by default in markup.  I have a slew of
> 	these on different boxen and have been moving then to one place.  
> 	Problem is: how to de-html the bunch.  
> 
> 	I'm too lazy to write something that would automate what Can be
> 	automated--markup like "&foo;" are problematic.  So probably the 
> 	easiest way would be to create a dehtml.sh script that is just a 
> 	wrapper around lynx.  
> 
> 	I don't think I'm the only hacker who wants just-plain-ascii, so
> 	this might mak a good project for somebody who's new to C or
> 	perl.   That's my two pennies' worth!
> 
> 	gary
> 
>> Cheers, Ian
>>
> 

If you don't want formatting and the number of tags is trivial, the 
solution is fairly simple in Perl (less than 150 lines, if even that).

-Garrett




Want to link to this message? Use this URL: <http://docs.FreeBSD.org/cgi/mid.cgi?464A6273.8080705>