Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 25 Sep 2008 16:01:28 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        gnn@FreeBSD.org
Cc:        net@FreeBSD.org
Subject:   Re: Proposed patch, convert IFQ_MAXLEN to kernel tunable...
Message-ID:  <20080925150227.L33284@delplex.bde.org>
In-Reply-To: <m2k5d15hke.wl%gnn@neville-neil.com>
References:  <m2skrq7jb1.wl%gnn@neville-neil.com> <20080924195331.GQ783@funkthat.com> <m2k5d15hke.wl%gnn@neville-neil.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 24 Sep 2008 gnn@FreeBSD.org wrote:

> At Wed, 24 Sep 2008 12:53:31 -0700,
> John-Mark Gurney wrote:
>>
>> George V. Neville-Neil wrote this message on Tue, Sep 23, 2008 at 15:29 -0400:
>>> It turns out that the last time anyone looked at this constant was
>>> before 1994 and it's very likely time to turn it into a kernel
>>> tunable.  On hosts that have a high rate of packet transmission
>>> packets can be dropped at the interface queue because this value is
>>> too small.  Rather than make a sweeping code change I propose the
>>> following change to the macro and updating a couple of places in the
>>> IP and IPv6 stacks that were using this macro to set their own global
>>> variables.

The global value is rarely used (except by low-end and/or 10 Mbps hardware
where it doesn't matter).  One important place where it is used (that you
changed) is for ipintrq, but this should have a different default anyway
and already has its own sysctl to fix up the broken default.

>> The better solution is to resurrect rwatson's patch that eliminates the
>> interface queue, and does direct dispatch to the ethernet driver..

If this were really better, then rwatson would have committed it.

>> Usually the driver has a queue of 512 or more packets already, so putting
>> them into a second queue doesn't provide much benefit besides increasing
>> the amount of locking necessary to deliver packets...

No, the second queue provides the fairly large benefit of doing watermark
processing via double buffering.  Most or all network drivers do no
watermark processing except accidentally via the second queue.  Even
with watermarks in the driver queue, it still takes a fairly large
external queue to prevent the transmitter running dry under load.
E.g., in my version of the bge driver, there is essentially a watermark
at 112 descriptors before the end of the driver tx queue.  Under load,
the driver tx queue is normally filled with 496 descriptors on every
output interrupt.  The driver enters state IFF_DRV_OACTIVE and upper
layers stop trying to transfer packets to the driver queue.  The next
output interrupt occurs after 384 of the 496 descriptors have been
handled, so that 112 remain.  Then IFF_DRV_OACTIVE is cleared and upper
layers resume transferring packets to the driver queue.  For output
to stream, and for efficiency, it is essential for upper layers to
have plenty of packets to transfer at this point, but without the
double buffering they would have none.  When things are working right,
the upper layers produce more than 384 descriptors by transferring at
this point so that the driver queue fills again.  For output to stream,
it is essential to produce 1 new descriptor faster than the hardware
can handle the 112 remaining ones and then keep producing new descriptors
faster than the hardware can handle the previous new ones.  With my
hardware, the 112 descriptors take about 150 usec to handle.  Buffering
in userland is too far away to deliver new packets within this time in
many cases (though syscalls are fast enough, scheduling delays are
usually too long).

> Actually I am making this change because I found on 10G hardware the
> queue is too small.  Also, there are many systems where you might want
> to up this, usually ones that are highly biased towards transmit only,
> like a multicast repeater of some sort.

You would rarely want the globabl default to apply to all interfaces.

Anywyay, interfaces with a hardware queue length (er, ring count) of
512 normally don't use the default, but use their ring count (bogusly
less 1) for the software queue length too.  This can be too small too.
In some cases, since select() for writing on sockets attached to
hardware doesn't work (it should block if all queues are full and
return soon IFF_DRV_ACTIVE transitions from set to clear and the queues
become non-full, but it only unerstands the socket buffer and doesn't
look at the queues), context switches to the application producing the
packets is delayed until at least the next scheduling tick or two,
which happens 2-20 msec later.  I use sendq lengths of ~20000 for bge
and em to prevent the queues running dry with a scheduling latency of
20 msec.  (10 Gbps would require prepostorous queue lengths of ~200000
for the same result.) However, even a queue length of (512+512-1) is
bad for performance.  Cycling through mbufs in a circular way ensures
busting of caches once the queue length is large enough for the memory
to exceed the cache size[s].  With a queue length of 50 and today's
cache sizes, it is almost possible to fit everything into the L1 cache.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080925150227.L33284>