From owner-freebsd-questions@FreeBSD.ORG Fri Nov 4 19:39:25 2005 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6F1FA16A41F for ; Fri, 4 Nov 2005 19:39:25 +0000 (GMT) (envelope-from cswiger@mac.com) Received: from pi.codefab.com (pi.codefab.com [199.103.21.227]) by mx1.FreeBSD.org (Postfix) with ESMTP id DE02043D48 for ; Fri, 4 Nov 2005 19:39:24 +0000 (GMT) (envelope-from cswiger@mac.com) Received: from localhost (localhost [127.0.0.1]) by pi.codefab.com (Postfix) with ESMTP id 299255F9E; Fri, 4 Nov 2005 14:39:24 -0500 (EST) Received: from pi.codefab.com ([127.0.0.1]) by localhost (pi.codefab.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 89458-10; Fri, 4 Nov 2005 14:39:22 -0500 (EST) Received: from [199.103.21.238] (pan.codefab.com [199.103.21.238]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by pi.codefab.com (Postfix) with ESMTP id A702D5C20; Fri, 4 Nov 2005 14:39:22 -0500 (EST) In-Reply-To: <200511041129.17912.kirk@strauser.com> References: <200511040956.19087.kirk@strauser.com> <436B8ADF.4000703@mac.com> <200511041129.17912.kirk@strauser.com> Mime-Version: 1.0 (Apple Message framework v746.2) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <0BC163E3-5E1D-4E7D-B7AD-C92AAA616228@mac.com> Content-Transfer-Encoding: 7bit From: Charles Swiger Date: Fri, 4 Nov 2005 14:39:21 -0500 To: Kirk Strauser X-Mailer: Apple Mail (2.746.2) X-Virus-Scanned: amavisd-new at codefab.com Cc: freebsd-questions@freebsd.org Subject: Re: Fast diff command for large files? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Nov 2005 19:39:25 -0000 On Nov 4, 2005, at 12:29 PM, Kirk Strauser wrote: >> Multigigabyte? Find another approach to solving the problem, a >> text-base >> diff is going to require excessive resources and time. A 64-bit >> platform >> with 2 GB of RAM & 3GB of swap requires ~1000 seconds to diff ~400MB. > > There really aren't many options. For the patient, here's what's > happening: > [ ... ] > And that's why I need a fast diff. Even if it takes as long as the > database > bulk loads, we can run it on another server and use 20 seconds of > CPU for > PostgreSQL instead of 45 minutes. The practical upshot is that the > database will never get sluggish, even if the other "diff server" > is loaded > to the gills. OK, but even if only one line out of 1000 changes, you still can't make either diff or Colin Percival's bsdiff run on gigabyte sized files and have it fit into MAXDSIZE on 32-bit address space. From the latter's website: "bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1) bytes of memory, where n is the size of the old file and m is the size of the new file. bspatch requires n+m+O(1) bytes. bsdiff runs in O((n+m) log n) time; on a 200MHz Pentium Pro, building a binary patch for a 4MB file takes about 90 seconds. bspatch runs in O(n+m) time; on the same machine, applying that patch takes about two seconds." Some time ago, I wrote a quick test harness for diff here: http://www.pkix.net/~chuck/difftest.py On a 5.4 machine with kern.dfldsiz="1G" set in /boot/loader.conf, you can only manage to run diff on files up to about 120 MB in size: 31-pi% ./difftest.py -v INFO: beginning diff trial run with ratio = 100 filea_size=10485760 (aka 10.000 MB) time=1.370 filea_size=10MB diff_size=818KB filea_size=15728640 (aka 15.000 MB) time=2.305 filea_size=15MB diff_size=1229KB filea_size=23592960 (aka 22.500 MB) time=5.443 filea_size=22MB diff_size=1844KB filea_size=35389440 (aka 33.750 MB) time=7.195 filea_size=33MB diff_size=2768KB filea_size=53084160 (aka 50.625 MB) time=16.771 filea_size=50MB diff_size=4163KB filea_size=79626240 (aka 75.938 MB) time=43.525 filea_size=75MB diff_size=6257KB filea_size=119439360 (aka 113.906 MB) time=78.346 filea_size=113MB diff_size=9MB filea_size=179159040 (aka 170.859 MB) diff: memory exhausted NOTICE: diff exitted with errno 2 time=36.896 filea_size=170MB diff_size=0KB 272.58s real 154.73s user 13.23s system 61% On a 64-bit SPARC box mentioned above, you can get sizes up to ~400 MB: [ ... ] filea_size=119439360 (aka 113.906 MB) time=140.650 filea_size=115MB diff_size=9MB filea_size=179159040 (aka 170.859 MB) time=424.586 filea_size=172MB diff_size=15MB filea_size=268738560 (aka 256.289 MB) time=546.334 filea_size=258MB diff_size=22MB filea_size=403107840 (aka 384.434 MB) time=957.059 filea_size=388MB diff_size=33MB filea_size=604661760 (aka 576.650 MB) diff: memory exhausted NOTICE: diff exitted with errno 2 time=105.728 filea_size=582MB diff_size=0KB 5610.90s real 3268.63s user 1761.90s system 89% Roughly, you need about an order of magnitude more RAM or virtual memory available then the size of the files you are trying to diff, even if the files are very similar. -- -Chuck