Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 08 Mar 2010 02:24:17 +0100 (CET)
From:      Alexander Best <alexbestms@wwu.de>
To:        Giorgos Keramidas <keramida@ceid.upatras.gr>
Cc:        Dan Nelson <dnelson@allantgroup.com>, freebsd-questions@freebsd.org
Subject:   Re: mailing list archive as mbox
Message-ID:  <permail-2010030801241780e26a0b0000466b-a_best01@message-id.uni-muenster.de>
In-Reply-To: <87bpf01d5m.fsf@kobe.laptop>

next in thread | previous in thread | raw e-mail | index | archive | help
Giorgos Keramidas schrieb am 2010-03-07:
> On Sun, 07 Mar 2010 12:08:32 +0100 (CET), Alexander Best
> <alexbestms@wwu.de> wrote:
> > Dan Nelson schrieb am 2010-03-07:
> >> In the last episode (Mar 07), Alexander Best said:
> >> > hi there,

> >> > what are the steps i need to perform to get a copy of the entire
> >> > mailingslist
> >> > archive of lets say freebsd-current@ in mbox format?

> >> Go to ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/
> >> where you
> >> can download weekly gzipped archives of all the mailing lists
> >> since
> >> their
> >> creation.

> > thanks for the hint, but it would take hours to download all those
> > gzipped
> > files, extract them and merge them.

> > i really need ALL the messages of a mailinglist. of course i could
> > use the
> > gzipped files you mentioned if i had some script for downloading
> > extracting
> > and merging all those files for me.

> It's relatively easy to hack one.

wow!!! thanks a billion. that's a great script. i pointed the vars containing
ftp sites at mirrors near me which give me better download speed and will run
the script for freebsd-current@ this night (~850 archives to pull).

thanks again. great job. :-)

alex

> You can get a list of year names from the /archive/ directory itself
> with curl(1) and a small amount of Python plumbing around curl:

>     >>> from subprocess import Popen as popen, PIPE
>     >>> import re
>     >>> yre = re.compile('^d.*\s(\d+)$')
>     >>> devnull = file("/dev/null")
>     >>> def years():
>     ...     curl = "curl -o /dev/stdout
>     ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/"
>     ...     ylist = []
>     ...     for line in popen(curl, shell=True, stdout=PIPE,
>     stderr=devnull).stdout.readlines():
>     ...         m = yre.match(line)
>     ...         if m:
>     ...             ylist.append(int(m.group(1)))
>     ...     return ylist
>     ...
>     >>> years()
>     [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
>     2004, 2005,
>      2006, 2007, 2008, 2009, 2010]

> Then you can grab a list of the freebsd-current archives by looping
> through the list of years and looking for the list of files that
> match
> the pattern:

>     ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/{year}/freebsd-current/(\d+.freebsd-current.gz)

> Using a pipe to parse the output of curl you can collect a list of
> all
> the files that match this pattern, e.g.:

>     >>> def yearfiles(year):
>     ...     base =
>     "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current"
>     % year
>     ...     curl = "curl -o /dev/stdout %s/" % base
>     ...     flist = []
>     ...     fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$')
>     ...     for line in popen(curl, shell=True, stdout=PIPE,
>     stderr=devnull).stdout.readlines():
>     ...         m = fre.match(line)
>     ...         if m:
>     ...             flist.append("%s/%s" % (base, m.group(1)))
>     ...     return flist
>     ...
>     >>> yearfiles(1994)
>     []
>     >>> yearfiles(1995)
>     ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/1.freebsd-current.gz',
>      ...]

> Concatenating the file lists of all years and fetching each one of
> them
> with curl is then trivial:

>     >>> ylist = years()
>     >>> ylist
>     [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
>     2004, 2005, 2006, 2007, 2008, 2009, 2010]
>     >>> flist = []
>     >>> for y in ylist:
>     ...     f = yearfiles(y)
>     ...     flist = flist + f
>     ...
>     >>> len(flist)
>     785

> Once you have the list of all the remote gzipped files, you can loop
> through the list of files once more and fetch them locally.  I'm only
> going to fetch the first two files here, but feel free to fetch all
> of
> them in your version of the script:

>     >>> flist = flist[:2]
>     >>> flist
>     ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz',
>      'ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz']


>     >>> from subprocess import call
>     >>> def getfile(url):
>     ...     out = os.path.basename(url)
>     ...     retcode = call(["curl", "-o", out, url], stderr=devnull)
>     ...     if retcode == 0:
>     ...         print "fetched %s" % url
>     ...     return tuple([url, out, retcode])
>     ...
>     >>> map(getfile, flist)
>     fetched
>     ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz
>     fetched
>     ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz
>     ...
>     [('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz',
>     '19950101.freebsd-current.gz', 0),
>      ('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz',
>      '19950226.freebsd-current.gz', 0)]


> A slightly hackish script that collects all this to a more usable
> whole
> but lacks LOTS of error checking is the following:

>     #!/usr/bin/env python

>     from subprocess import call, Popen as popen, PIPE
>     import os
>     import re
>     import sys

>     devnull = file("/dev/null")
>     yre = re.compile('^d.*\s(\d+)$')
>     fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$')

>     def years():
>         curl = "curl -o /dev/stdout
>         ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/"
>         ylist = []
>         for line in popen(curl, shell=True, stdout=PIPE,
>         stderr=devnull).stdout.readlines():
>             m = yre.match(line)
>             if m:
>                 ylist.append(int(m.group(1)))
>         return ylist

>     def yearfiles(year):
>         base =
>         "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current"
>         % year
>         curl = "curl -o /dev/stdout %s/" % base
>         flist = []
>         for line in popen(curl, shell=True, stdout=PIPE,
>         stderr=devnull).stdout.readlines():
>             m = fre.match(line)
>             if m:
>                 flist.append("%s/%s" % (base, m.group(1)))
>         return flist

>     def getfile(url):
>         out = os.path.basename(url)
>         retcode = call(["curl", "-o", out, url], stderr=devnull)
>         if retcode == 0:
>             print "fetched %s" % url
>         return tuple([url, out, retcode])

>     if __name__ == "__main__":
>         print "Fetching year list."
>         ylist = years()
>         if len(ylist) == 0:
>             print "No yearly archives found."
>             sys.exit(1)
>         print "Fetching file lists for %d years." % len(ylist)

>         flist = []
>         for y in ylist:
>             f = yearfiles(y)
>             flist = flist + f
>         if len(flist) == 0:
>             print "No archives found."
>             sys.exit(1)
>         print "Fetching %d archives." % len(flist)
>         fresult = map(getfile, flist)

>         fok = [fentry[1] for fentry in fresult if fentry[2] == 0]
>         ferr = [fentry[1] for fentry in fresult if fentry[2] != 0]
>         if len(fok) > 0:
>             print ""
>             print "Successfully downloaded %d archives" % len(fok)
>             for f in fok:
>                 print "    %s" % f
>         if len(ferr) > 0:
>             print ""
>             print "Failed to download %d archives" % len(ferr)
>             for f in ferr:
>                 print "    %s" % f

> Running this with a couple of lines to limit the FTP connections a
> bit
> and fetch only parts of the freebsd-current mail archives produces
> the
> following output on my laptop:

>     keramida@kobe:/tmp$ python foo.py
>     Fetching year list.
>     Fetching file lists for 3 years.
>     Fetching 5 archives.
>     fetched
>     ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz
>     fetched
>     ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz
>     fetched
>     ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950305.freebsd-current.gz
>     fetched
>     ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950312.freebsd-current.gz
>     fetched
>     ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950319.freebsd-current.gz

>     Successfully downloaded 5 archives
>         19950101.freebsd-current.gz
>         19950226.freebsd-current.gz
>         19950305.freebsd-current.gz
>         19950312.freebsd-current.gz
>         19950319.freebsd-current.gz

> Without the limiting code that I removed from the example, it will
> try
> to fetch all the archive files for all 17 years.

> Then you can simply type:

>     gzip -cd *.freebsd-current.gz > freebsd-current.mbox

> to produce a single UNIX mbox file with all the messages.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?permail-2010030801241780e26a0b0000466b-a_best01>