From owner-freebsd-questions@FreeBSD.ORG  Wed Jan 14 02:57:36 2009
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 624091065670
	for <freebsd-questions@freebsd.org>;
	Wed, 14 Jan 2009 02:57:36 +0000 (UTC)
	(envelope-from cpghost@cordula.ws)
Received: from fw.farid-hajji.net (fw.farid-hajji.net [213.146.115.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 81A398FC1A
	for <freebsd-questions@freebsd.org>;
	Wed, 14 Jan 2009 02:57:35 +0000 (UTC)
	(envelope-from cpghost@cordula.ws)
Received: from phenom.cordula.ws (phenom [192.168.254.60])
	by fw.farid-hajji.net (Postfix) with ESMTP id 5A822364F2;
	Wed, 14 Jan 2009 03:57:33 +0100 (CET)
Date: Wed, 14 Jan 2009 03:57:32 +0100
From: cpghost <cpghost@cordula.ws>
To: freebsd-questions@freebsd.org
Message-ID: <20090114025732.GA98196@phenom.cordula.ws>
References: <20090102164412.GA1258@phenom.cordula.ws>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20090102164412.GA1258@phenom.cordula.ws>
User-Agent: Mutt/1.5.18 (2008-05-17)
Subject: Re: Foiling MITM attacks on source and ports trees
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Jan 2009 02:57:36 -0000

On Fri, Jan 02, 2009 at 05:44:12PM +0100, cpghost wrote:
> Any idea? Could this be implemented as a plugin to Subversion (since
> it must access previous revisions of files and previously computed
> digests)? Given read-only access to the repository, a set of simple
> Python scripts or C/C++ programs could easily implement the basic
> functionality and cache the results for fast retrieval by other
> scripts. But how well will all this scale?

Sorry to revive this thread by replying to self, but nothing has
materialized out of it (yet). Considering all that has been said
up until now, it boils down to this:

Issue #1 was signing the list:

With or without SSL/TLS certificates, the (compressed) list could be
signed by a web-trusted GnuPG Project Key, so let's assume it will be,
and deal with the issue of transmission over SSL and how to get a
certificate for the server(s) later (if at all).

Issue #2 was how to generate the list out of the repository.

A script that has (read-only) access to the Subversion repo would
first in batch mode compute md5/sha256 checksums for *all* revisions
available. It may take some time, but so what? It's only a one-time
job, so let it run overnight to checksum the few GBs. The results
could be stored in an arbitrary database.

Then, another script will have to be hooked into Subversion, so that
each commit will have that script compute the md5/sha256 checksums of
the additional revisions, and store them in the database as well.
This doesn't seem too much of a burden on the server, because even if
the commits come in bursts, the number of bytes to commit are very
fast checksummed... and saved in the database (I think / hope). It
doesn't look like an overly expensive operation.

Issue #3 was how to generate the list on-demand.

That's a simple database query script, that would select a subset
of files, revisions and checksums from the database, compress the
result, sign it with the GnuPG Project Key, and return it to the user.

This scales well to many concurrent client queries, because the
database is independent from the Subversion server and can run on
separate hardware -- and even be replicated if need be.

Issue #4 was how to get the checksums on the client side.

A simple app could connect to the "checksum server" (the app defined
in Issue #3) -- or one of its mirrors if need be -- and select a
signed list for a specific subrange (say, now up to 24h in the
past). It would verify the signature using the public Project Key
(obtained through a secure channel -- but let's care about that
later when the infrastructure is in place).

This app could factor out the tasks of querying the server and
checking the signature into a library, that could also be used by
an expanded version of csup. The idea is that csup, called with a
special flag would verify the checksums of all files downloaded
in the current run; while the main app could still check the integrity
of a tree fetched 2 months ago, provided it is called with the right
time stamp.

Issue #5 was how to identify the revisions of files stored locally.

That's a tough one, AFAICS. How to solve that one? Ideas? On
old trees, its kinda hopeless (but read below); new invocations
of a modified csup could save metadata, including revisions numbers
somewhere (/var/db/sup perhaps), and use those metadata.

For old(er) trees, checksums could be computed locally and sent
to the "checksum server" for identification purposes. The server
could match the path and checksums obtained through the client,
and return a revision number (if any) out of the database. That
in turn could be stored post-facto in /var/db/sup, and everything
could proceed as above.

Soo... implementation should now be easy as pie and require just a few
lines of Python or a few more lines of C and a couple of little
programs... and of course read-only access to the repository for
deployment once it's ready. Or is it not yet?

Thanks,
-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/