Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 7 Nov 2005 09:48:22 -0600
From:      Kirk Strauser <kirk@strauser.com>
To:        freebsd-questions@freebsd.org
Subject:   Re: Fast diff command for large files?
Message-ID:  <200511070948.27910.kirk@strauser.com>
In-Reply-To: <cb5206420511060539qe4d7c40i198e806950c60482@mail.gmail.com>
References:  <200511040956.19087.kirk@strauser.com> <200511060657.39674.kirk@strauser.com> <cb5206420511060539qe4d7c40i198e806950c60482@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--nextPart2449820.Ro4SCRXWNq
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

On Sunday 06 November 2005 07:39, Andrew P. wrote:

> Note, that the difference must be kept in RAM, so it won't work if there=
=20
> are multi-gig diffs, but it will work very fast if the diffs are only=20
> 10-100Mb, it will work at close to I/O speed if the diff is under 10Mb. =
=20

Thanks, Andrew!  My Python script runs that algorithm in 17 seconds on a=20
400MB file with 10% CPU.

=46or anyone interested, here's my implementation.  Note that the readline(=
)=20
method in Python always returns something, even at EOF (at which point you=
=20
get an empty string).  Also, empty strings evaluate as "false", which is=20
why the "if not (oldline or newline): break" code exits at the end.

    old_records =3D []
    new_records =3D []

    while 1:
        oldline, newline =3D oldfile.readline(), newfile.readline()
        if not (oldline or newline):
            break
        if oldline =3D=3D newline:
            continue

        try:
            new_records.remove(oldline)
        except ValueError:
            if oldline:
                old_records.append(oldline)

        try:
            old_records.remove(newline)
        except ValueError:
            if newline:
                new_records.append(newline)

> Hope this gives you some idea.

It did.  It must've been a long work week, because that all seems so obviou=
s=20
in retrospect but was completely opaque at the time.  Thanks again!
=2D-=20
Kirk Strauser

--nextPart2449820.Ro4SCRXWNq
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----

iD8DBQBDb3dL5sRg+Y0CpvERAhUcAJ0XNZ4mWtxZgvUbbPbWbX77lI/CmwCfWZrH
aiMPAA3WfoC1eKlNWbAMiGA=
=qYPx
-----END PGP SIGNATURE-----

--nextPart2449820.Ro4SCRXWNq--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200511070948.27910.kirk>