Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 11 Oct 2006 12:27:20 +0200
From:      Kyrre =?iso-8859-1?Q?Nyg=E5rd?= <kyrreny@broadpark.no>
To:        questions@freebsd.org
Subject:   Script to fetch Wikipedia text
Message-ID:  <7.0.1.0.2.20061011122630.03a86968@broadpark.no>

next in thread | raw e-mail | index | archive | help

	Hey!

	I'm involved in a few research projects, and like to keep my 
information well organized. I usually get most of it from Wikipedia, 
however, I hate printing HTML articles to PDF. I'd rather want them 
in pure, well laid out text. And I'm sure others would too. Being 
able to master ones knowledge provides a warm inner peace.

	Hence I've tried dumping the output from text browsers such as w3m, 
elinks, lynx etc. I am, however, only interested in the articles 
themselves, not their links, views, toolboxes, search bars, other 
available languages and so on. I tried running a whole bunch of 
regular expressions over the output, but that really felt like the hard way.

	So some guy gave me this:

#!/usr/bin/env ruby

require 'rexml/document'
require 'cgi'
require 'tempfile'
require 'open-uri'

url = 'http://en.wikipedia.org/wiki/Special:Export/' + 
CGI::escape(ARGV.join(" ").strip.squeeze(' ').tr(' ', 
'_')).gsub(/%3[Aa]/,':').gsub(/%2[Ff]/,'/').gsub(/%23/,'#')

open(url) { |f|
   puts REXML::XPath.first(REXML::Document::new(f.class == Tempfile ? 
f.open : f), '//text').text
}

	Which seem to take advantage of Wikipedia's special export feature, 
which really seems cool. However there's a few issues. First, the 
script looks kinda complex. I'm sure there's a simpler way of writing 
it. Second, it does not yet output the kind of pure and well laid out 
text as it should. For instance, on 
http://en.wikipedia.org/wiki/GNU_Hurd, it outputs:

########## BEGIN

{{Infobox_Software
| name = GNU Hurd
| logo = [[Image:Hurd-logo.png]]<br />
| developer = [[Thomas Bushnell| Michael (now Thomas) Bushnell]] 
(original developer) and various contributors
| latest_release_version =
| latest_release_date =
| operating_system = [[GNU]]
| genre = [[Kernel (computer science)|Kernel]]
| family = [[POSIX]]-conformant [[Unix]]-Clones
| kernel_type = [[Microkernel]]
| license = [[GNU General Public License|GPL]]
| source_model = [[Free software]]
| working_state = In production / development
| website = [http://www.gnu.org/software/hurd/hurd.html www.gnu.org]
}}
{{redirect|Hurd}}
'''The GNU Hurd''' is a computer operating system [[Kernel (computer 
science)|kernel]]. It consists of a set of [[Server 
(computing)|servers]] (or [[daemon (computer software)|daemons]], in 
[[Unix]]-speak) that work on top of either the [[GNU Mach]] 
[[microkernel]] or the [[L4 microkernel family|L4 microkernel]]; 
together, they form the [[kernel (computer science)|kernel]] of the 
[[GNU]] [[operating system]].  It has been under development since 
[[1990]] by the [[GNU]] Project and is distributed as [[free 
software]] under the [[GNU General Public License|GPL]].  The Hurd 
aims to surpass [[Unix]] kernels in functionality, security, and 
stability, while remaining largely compatible with them. This is done 
by having the Hurd track the [[POSIX]] specification, while avoiding 
arbitrary restrictions on the user.

"HURD" is an indirectly [[recursive acronym]], standing for "HIRD of 
[[Unix]]-Replacing [[Daemon (computer software)|Daemons]]", where 
"HIRD" stands for "HURD of Interfaces Representing Depth". It is also 
a play of words to give "[[herd]] of [[wildebeest|gnus]]" reflecting 
how it works.

==Development history==
Development on the GNU operating system began in 1984 and progressed 
rapidly. By the early 1990s, the only major component missing was the kernel.

Development on the Hurd began in [[1990]], after an abandoned kernel 
attempt started from the finished research [[Trix (kernel)|Trix]] 
operating system developed by Professor [[Steve Ward (Computer 
Scientist)| Steve Ward]] and his group at [[Massachusetts Institute 
of Technology| MIT]]'s [[Laboratory for Computer Science]] (LCS). 
According to [[Thomas Bushnell| Michael (now T
homas) Bushnell]], the initial Hurd architect, their early plan was 
to adapt the [[BSD]] 4.4-Lite kernel and, in hindsight, "It is now 
perfectly obvious to me that this would have succeeded splendidly and 
the world would be a very different place today".<ref>{{cite web | 
url = http://www.groklaw.net/article.php?story=20050727225542530 | 
title = The Hurd and BSDI|accessdate = 2006-08-08 | author = Peter H. 
Salus | work = The Daemon, the GNU and the Penguin}}</ref> However, 
due to a lack of cooperation from the [[University of California, 
Berkeley|Berkeley]] programmers, [[Richard Stallman]] decided instead 
to use the [[Mach microkernel]], which subsequently proved 
unexpectedly difficult, and the Hurd's development proceeded slowly.

########## END

This should instead be something like:

########## BEGIN

http://en.wikipedia.org/wiki/GNU_Hurd

Name = GNU Hurd
Developer = Thomas Bushnell (original developer) and various contributors
Operating_system = GNU
Genre = Kernel (computer science)
Family = POSIX-conformant Unix-Clones
Kernel type = Microkernel
License = GNU General Public License
Source model = Free software
Working state = In production / development
Website = http://www.gnu.org/software/hurd/hurd.html
           http://www.gnu.org


The GNU Hurd is a computer operating system. It consists of a set of 
servers (or daemons, in Unix-speak) that work on top of either the 
GNU Mach microkernel or the L4 microkernel; together, they form the 
kernel of the GNU operating system.  It has been under development 
since 1990 by the GNU Project and is distributed as free software 
under the GPL. The Hurd aims to surpass Unix kernels in 
functionality, security, and stability, while remaining largely 
compatible with them. This is done by having the Hurd track the POSIX 
specification, while avoiding arbitrary restrictions on the user.

``HURD'' is an indirectly recursive acronym, standing for ``HIRD of 
Unix-Replacing Daemons", where ``HIRD'' stands for ``HURD of 
Interfaces Representing Depth". It is also a play of words to give 
``herd of gnus'' reflecting how it works.

Development history

Development on the GNU operating system began in 1984 and progressed 
rapidly. By the early 1990s, the only major component missing was the kernel.

Development on the Hurd began in 1990, after an abandoned kernel 
attempt started from the finished research Trix operating system 
developed by Professor Steve Ward and his group at MIT's Laboratory 
for Computer Science (LCS). According to Michael (now Thomas) 
Bushnell, the initial Hurd architect, their early plan was to adapt 
the BSD 4.4-Lite kernel and, in hindsight, "It is now perfectly 
obvious to me that this would have succeeded splendidly and the world 
would be a very different place today". However, due to a lack of 
cooperation from the Berkeley programmers, Richard Stallman decided 
instead to use the Mach microkernel, which subsequently proved 
unexpectedly difficult, and the Hurd's development proceeded slowly.

########## END

	Looks real gorgeous doesn't it? Had I only been skilled enough to do 
this myself. Which brings me to my question: Is anybody out there 
willing to help me fix my script?

Thanks a lot,
Kyrre




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7.0.1.0.2.20061011122630.03a86968>