From owner-freebsd-questions@FreeBSD.ORG Fri Nov 4 20:04:24 2005 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2A67316A41F for ; Fri, 4 Nov 2005 20:04:24 +0000 (GMT) (envelope-from infofarmer@gmail.com) Received: from zproxy.gmail.com (zproxy.gmail.com [64.233.162.201]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7DDC243D6B for ; Fri, 4 Nov 2005 20:04:23 +0000 (GMT) (envelope-from infofarmer@gmail.com) Received: by zproxy.gmail.com with SMTP id 8so444843nzo for ; Fri, 04 Nov 2005 12:04:22 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=HYNpJspaHwPcoL6Dk+GXxHXlmxBjxrjMNcp2g0WNuCSEswtiTRoaIwSC7jPBRfW7JZ83E5DI2sRMLj4FhGXg4qIzO2EG6d2JjI539ZoGjyX3aoBB6zEOUY8FKb/lxD4pl95zdjTlZHCXt82g2CWNH8XZXDyry52aZOuBgXMCRWc= Received: by 10.36.59.10 with SMTP id h10mr807223nza; Fri, 04 Nov 2005 12:04:22 -0800 (PST) Received: by 10.37.20.34 with HTTP; Fri, 4 Nov 2005 12:04:22 -0800 (PST) Message-ID: Date: Fri, 4 Nov 2005 23:04:22 +0300 From: "Andrew P." To: Kirk Strauser In-Reply-To: <200511041129.17912.kirk@strauser.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: <200511040956.19087.kirk@strauser.com> <436B8ADF.4000703@mac.com> <200511041129.17912.kirk@strauser.com> Cc: freebsd-questions@freebsd.org Subject: Re: Fast diff command for large files? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Nov 2005 20:04:24 -0000 On 11/4/05, Kirk Strauser wrote: > On Friday 04 November 2005 10:22, Chuck Swiger wrote: > > > Multigigabyte? Find another approach to solving the problem, a text-ba= se > > diff is going to require excessive resources and time. A 64-bit platfo= rm > > with 2 GB of RAM & 3GB of swap requires ~1000 seconds to diff ~400MB. > > There really aren't many options. For the patient, here's what's happeni= ng: > > Our legacy application runs on FoxPro. Our web application runs on a > PostgreSQL database that's a mirror of the FoxPro tables. > > We do the mirroring by running a program that dumps the FoxPro tables out= as > tab-delimited files. Thus far, we'd been using PostgreSQL's "copy from" > command to read those files into the database. In reality, though, a ver= y, > very small percentage of rows in those tables actually change. So, I wro= te > a program that takes the output of diff and converts it into a series of > "delete" and "insert" commands; benchmarking shows that this is roughly 3= 00 > times faster in our use. > > And that's why I need a fast diff. Even if it takes as long as the datab= ase > bulk loads, we can run it on another server and use 20 seconds of CPU for > PostgreSQL instead of 45 minutes. The practical upshot is that the > database will never get sluggish, even if the other "diff server" is load= ed > to the gills. > -- > Kirk Strauser > > > Does the overall order of lines change every time you dump the tables? If not, is there any inexpensive way to sort them (not alphabetically, but just that the order stays the same)? If it does/can, then there's a trivial solution (a few lines in perl, or a hundred lines in C) that'll make the speed roughly similar to that of I/O.