Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 19 Jul 2014 06:40:14 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        hiren panchasara <hiren.panchasara@gmail.com>
Cc:        Adrian Chadd <adrian@freebsd.org>, "freebsd-net@freebsd.org" <net@freebsd.org>
Subject:   Re: UDP sendto() returning ENOBUFS - "No buffer space available"
Message-ID:  <20140719053318.I15959@besplex.bde.org>
In-Reply-To: <CALCpEUE-vebmaGSK5aGM%2B3q5YqzXkn1P=St7R8G_ztmHmgUBBA@mail.gmail.com>
References:  <CALCpEUE7OtbXjVTk2C8%2BV7fjOKutuNq04BTo0SN42YEgX81k-Q@mail.gmail.com> <CAJ-VmokEiZMpdfNjs%2B-C9pmRcjOOjjNGTvM88muh940sr7SmPw@mail.gmail.com> <CALCpEUE-vebmaGSK5aGM%2B3q5YqzXkn1P=St7R8G_ztmHmgUBBA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 18 Jul 2014, hiren panchasara wrote:

> On Wed, Jul 16, 2014 at 11:00 AM, Adrian Chadd <adrian@freebsd.org> wrote:
>> Hi!
>>
>> So the UDP transmit path is udp_usrreqs->pru_send() == udp_send() ->
>> udp_output() -> ip_output()
>>
>> udp_output() does do a M_PREPEND() which can return ENOBUFS. ip_output
>> can also return ENOBUFS.
>>
>> it doesn't look like the socket code (eg sosend_dgram()) is doing any
>> buffering - it's just copying the frame and stuffing it up to the
>> driver. No queuing involved before the NIC.
>
> Right. Thanks for confirming.

Most buffering should be in ifq above the NIC.  For UDP, I think
udp_output() puts buffers on the ifq and calls the driver for every
one, but the driver shouldn't do anything for most calls.  The
driver can't possibly do anything if its ring buffer is full, and
shouldn't do anything if it is nearly full.  Buffers accumulate in
the ifq until the driver gets around to them or the queue fills up.
Most ENOBUFS errors are for when it fills up.  It can very easily
fill up, especially since it is too small in most configurations.
Just loop calling sendto().  This will fill the ifq almost
instantly unless the hardware is faster than the software.

>> So a _well behaved_ driver will return ENOBUFS _and_ not queue the
>> frame. However, it's entirely plausible that the driver isn't well
>> behaved - the intel drivers screwed up here and there with transmit
>> queue and failure to queue vs failure to transmit.

No, the driver doesn't have much control over the ifq.

>> So yeah, try tweaking the tx ring descriptor for the driver your'e
>> using and see how big a bufring it's allocating.
>
> Yes, so I am dealing with Broadcom BCM5706/BCM5708 Gigabit Ethernet,
> i.e. bce(4).
>
> I bumped up tx_pages from 2 (default) to 8 where each page is 255
> buffer descriptors.
>
> I am seeing quite nice improvement on stable/10 where I can send
> *more* stuff :-)

255 is not many.  I am most familiar with bge where there is a single
tx ring with 511 or 512 buffer descriptors (some bge's have more, but
this is unportable and was not supported last time I looked.  The
extras might be only for input).  One of my bge's can do 640 kpps with
tiny packets (only 80 kpps with normal packets) and the other only 200
(?) kpps (both should be limited mainly by the PCI bus, but the slow
one is limited by it being a dumbed down 5705"plus").  At 640 kpps,
it takes 800 microseconds to transmit 512 packets.  (There is 1 packet
per buffer descriptor for small packets.)

Considerable buffering in ifq is needed to prevent the transmitter
running dry whenever the application stops generating packets for more
than 800 microseconds for some reason, but the default buffering is
stupidly small.  The default is given by net.inet.ifqmaxlen and some
corresponding macros, and is still just 50.  50 was enough for 1 Mpbs
ethernet and perhaps even for 10 Mbps, but is now too small.  Most
drivers don't use it, but use their own too-small value.  bge uses
just its own ring buffer size of 511.  I use 10000 or 40000 depending
on hz:

% diff -u2 if_bge.c~ if_bge.c
% --- if_bge.c~	2012-03-13 02:13:48.144002000 +0000
% +++ if_bge.c	2012-03-13 02:13:50.123023000 +0000
% @@ -3315,5 +3316,6 @@
%  	ifp->if_start = bge_start;
%  	ifp->if_init = bge_init;
% -	ifp->if_snd.ifq_drv_maxlen = BGE_TX_RING_CNT - 1;
% +	ifp->if_snd.ifq_drv_maxlen = BGE_TX_RING_CNT +
% +	    imax(4 * tick, 10000) / 1;
%  	IFQ_SET_MAXLEN(&ifp->if_snd, ifp->if_snd.ifq_drv_maxlen);
%  	IFQ_SET_READY(&ifp->if_snd);

40000 is what is needed for 4 tick's worth of buffering at hz = 100.
40000 is far too large where 50 is far too small, but something like
it is needed when hz is large due to another problem: select() on
the ENOBUFS condition is broken (unsupported), so when sendto()
returns ENOBUFS there is no way for the application to tell how
long it should wait before retrying.  If it wants to burn CPU then
it can spin calling sendto().  Otherwise, it should sleep, but
with a sleep granularity of 1 tick this requires several ticks worth
of buffering to avoid the transmitter running dry.  Large queue lengths
give a large latency for packets at the end of the queue and give no
chance of the working set fitting in an Ln cache for small n.

The precise stupidly small value of (tx_ring_count - 1) for the ifq
length seems to be for no good reason.  Subtracting 1 is apparently
to increase the chance that all packets in the ifq can be fitted into
the tx ring.  But this is silly since the ifq packet count is in
dufferent units to the buffer descriptor count.  For normal-size
packets, there are 2 descriptors per packet.  So in the usual case
where the ifq is full, only about half of it can be moved to the tx
ring.  And this is good since it gives a little more buffering.
Otherwise, the effective buffering is just what is in the tx ring,
since none is left in the ifq after transferring eveyrhing.

(tx_ring_count - 1) is used by many other drivers.  E.g., fxp.  fxp
is only 100 Mbps and its tx_ring_count is 128.  128 is a bit larger
than 50 but not enough.  Scaling down my 40000 gives 4000 for hz = 100
and 400 for hz = 1000.  I never worried about this problem at 100 Mpbs.

Changing from 2 rings of length 255 to 8 of length 255 shouldn't make
much difference if other things are configured correctly.  It doesn't
matter much if the buffering is in ifq or in ring buffers.  Multiple
ring buffers, filled in advance of the active one running dry so that
the next one can be switched to quickly, mainly give you a tiny latency
optimization.  I get similar optimizations more in software for bge,
by doing watermark stuff.  The boundary between the ifq and the tx
ring also acts as a primitive watermark.  With watermarks, it is best
to not divide up the buffer evenly, but that is what the
(tx_ring_count - 1) sizing for the ifq sort of does.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140719053318.I15959>