Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 04 Feb 2008 14:52:35 +0100
From:      =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
To:        Jeremy Chadwick <koitsu@FreeBSD.org>
Cc:        hackers@freebsd.org, Ed Schouten <ed@fxq.nl>
Subject:   Re: sort(1) memory usage
Message-ID:  <86d4rcg76k.fsf@ds4.des.no>
In-Reply-To: <20080204125831.GA4052@eos.sc1.parodius.com> (Jeremy Chadwick's message of "Mon\, 4 Feb 2008 04\:58\:31 -0800")
References:  <8663x6mc2o.fsf@ds4.des.no> <20080203131322.GK1179@hoeg.nl> <20080203151550.GA67020@owl.midgard.homeip.net> <86prvekqs2.fsf@ds4.des.no> <86lk62kqeh.fsf@ds4.des.no> <20080204125831.GA4052@eos.sc1.parodius.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Jeremy Chadwick <koitsu@FreeBSD.org> writes:
> As you said: the code shows that when no files are specified (e.g.  read
> off a pipe), sort will make some assumptions regarding the initial
> buffer size to read data into.  The buffer size allocated in that case
> is fairly large, rather than basing it off of the first line off stdin;
> it looks like this is done to save CPU time in the long run (otherwise
> you'd have to rellocate more later and take a hit; initbuf() is
> responsible for that).

Oh give me a break, the self-starting exponential algorithm for growth
of dynamically allocated buffers has been known for decades.  In case
any GNU sort developers are reading this, here it comes, free of charge:

static char *buf =3D NULL;
static size_t bufsz =3D 0;
static size_t buflen =3D 0;

int
buf_append(const char *str)
{
        size_t len;

        len =3D strlen(str);
        if (buflen + len + 1 > bufsz) {
                size_t nbufsz =3D bufsz;
                char *nbuf;

                while (buflen + len + 1 > nbufsz)
                        nbufsz =3D nbufsz * 2 + 1;
                nbuf =3D realloc(buf, nbufsz);
                if (nbuf =3D=3D NULL)
                        return (-1);
                buf =3D nbuf;
                bufsz =3D nbufsz;
        }
        memcpy(buf + buflen, str, len);
        buf[buflen + len] =3D '\0';
        buflen +=3D len;
        return (0);
}

With a good allocator - and depending to some extent on the memory usage
pattern of the rest of your program - if you jump-start it by initally
allocating 16 kB or so (and setting bufsz accordingly), realloc() will
never need to copy anything - but even in the worst case, the amortized
cost will be O(2n), IIRC.  This is practically unnoticeable next to the
cost of the sorting algorithm itself, which will be O(n log n) at best
and O(n*n) at worst.

> > Looking at the code, it seems to go to extreme lengths to get it
> > absolutely wrong.  For instance, if hw.physmem / 8 > hw.usermem, it will
> > pick the former, which means it's pretty much guaranteed to either fail
> > or hose your system (or both).
> Can you expand on this?  Looking at the code, it doesn't appear that's
> possible.  The code in question is default_sort_size(), which is used
> when no -S or --buffer-size argument is specified.

I looked at how it computes the cap, which is MAX(total / 8, avail) - in
other words, never mind what's actually feasible, I want more!  More!
More, I say!

> > Count this as a vote for ditching GNU sort in favor of a BSD-licensed
> > implementation (from {Net,Open}BSD for instance).
> In this specific case, I think you're bashing GNU just because you feel
> like it.  Come on man... =3D/

No, I'm bashing GNU because it's bloated crap, as this example clearly
shows.  It wouldn't be the first time a BSD rewrite of a GNU tool ended
up performing better; see for instance bsdtar.  Besides, the FreeBSD
project has a tradition of replacing GNU tools with BSD-licensed
equivalents as long as no functionality is lost.

DES
--=20
Dag-Erling Sm=C3=B8rgrav - des@des.no



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86d4rcg76k.fsf>