Date: Sat, 19 Jul 2014 06:40:14 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: hiren panchasara <hiren.panchasara@gmail.com> Cc: Adrian Chadd <adrian@freebsd.org>, "freebsd-net@freebsd.org" <net@freebsd.org> Subject: Re: UDP sendto() returning ENOBUFS - "No buffer space available" Message-ID: <20140719053318.I15959@besplex.bde.org> In-Reply-To: <CALCpEUE-vebmaGSK5aGM%2B3q5YqzXkn1P=St7R8G_ztmHmgUBBA@mail.gmail.com> References: <CALCpEUE7OtbXjVTk2C8%2BV7fjOKutuNq04BTo0SN42YEgX81k-Q@mail.gmail.com> <CAJ-VmokEiZMpdfNjs%2B-C9pmRcjOOjjNGTvM88muh940sr7SmPw@mail.gmail.com> <CALCpEUE-vebmaGSK5aGM%2B3q5YqzXkn1P=St7R8G_ztmHmgUBBA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 18 Jul 2014, hiren panchasara wrote: > On Wed, Jul 16, 2014 at 11:00 AM, Adrian Chadd <adrian@freebsd.org> wrote: >> Hi! >> >> So the UDP transmit path is udp_usrreqs->pru_send() == udp_send() -> >> udp_output() -> ip_output() >> >> udp_output() does do a M_PREPEND() which can return ENOBUFS. ip_output >> can also return ENOBUFS. >> >> it doesn't look like the socket code (eg sosend_dgram()) is doing any >> buffering - it's just copying the frame and stuffing it up to the >> driver. No queuing involved before the NIC. > > Right. Thanks for confirming. Most buffering should be in ifq above the NIC. For UDP, I think udp_output() puts buffers on the ifq and calls the driver for every one, but the driver shouldn't do anything for most calls. The driver can't possibly do anything if its ring buffer is full, and shouldn't do anything if it is nearly full. Buffers accumulate in the ifq until the driver gets around to them or the queue fills up. Most ENOBUFS errors are for when it fills up. It can very easily fill up, especially since it is too small in most configurations. Just loop calling sendto(). This will fill the ifq almost instantly unless the hardware is faster than the software. >> So a _well behaved_ driver will return ENOBUFS _and_ not queue the >> frame. However, it's entirely plausible that the driver isn't well >> behaved - the intel drivers screwed up here and there with transmit >> queue and failure to queue vs failure to transmit. No, the driver doesn't have much control over the ifq. >> So yeah, try tweaking the tx ring descriptor for the driver your'e >> using and see how big a bufring it's allocating. > > Yes, so I am dealing with Broadcom BCM5706/BCM5708 Gigabit Ethernet, > i.e. bce(4). > > I bumped up tx_pages from 2 (default) to 8 where each page is 255 > buffer descriptors. > > I am seeing quite nice improvement on stable/10 where I can send > *more* stuff :-) 255 is not many. I am most familiar with bge where there is a single tx ring with 511 or 512 buffer descriptors (some bge's have more, but this is unportable and was not supported last time I looked. The extras might be only for input). One of my bge's can do 640 kpps with tiny packets (only 80 kpps with normal packets) and the other only 200 (?) kpps (both should be limited mainly by the PCI bus, but the slow one is limited by it being a dumbed down 5705"plus"). At 640 kpps, it takes 800 microseconds to transmit 512 packets. (There is 1 packet per buffer descriptor for small packets.) Considerable buffering in ifq is needed to prevent the transmitter running dry whenever the application stops generating packets for more than 800 microseconds for some reason, but the default buffering is stupidly small. The default is given by net.inet.ifqmaxlen and some corresponding macros, and is still just 50. 50 was enough for 1 Mpbs ethernet and perhaps even for 10 Mbps, but is now too small. Most drivers don't use it, but use their own too-small value. bge uses just its own ring buffer size of 511. I use 10000 or 40000 depending on hz: % diff -u2 if_bge.c~ if_bge.c % --- if_bge.c~ 2012-03-13 02:13:48.144002000 +0000 % +++ if_bge.c 2012-03-13 02:13:50.123023000 +0000 % @@ -3315,5 +3316,6 @@ % ifp->if_start = bge_start; % ifp->if_init = bge_init; % - ifp->if_snd.ifq_drv_maxlen = BGE_TX_RING_CNT - 1; % + ifp->if_snd.ifq_drv_maxlen = BGE_TX_RING_CNT + % + imax(4 * tick, 10000) / 1; % IFQ_SET_MAXLEN(&ifp->if_snd, ifp->if_snd.ifq_drv_maxlen); % IFQ_SET_READY(&ifp->if_snd); 40000 is what is needed for 4 tick's worth of buffering at hz = 100. 40000 is far too large where 50 is far too small, but something like it is needed when hz is large due to another problem: select() on the ENOBUFS condition is broken (unsupported), so when sendto() returns ENOBUFS there is no way for the application to tell how long it should wait before retrying. If it wants to burn CPU then it can spin calling sendto(). Otherwise, it should sleep, but with a sleep granularity of 1 tick this requires several ticks worth of buffering to avoid the transmitter running dry. Large queue lengths give a large latency for packets at the end of the queue and give no chance of the working set fitting in an Ln cache for small n. The precise stupidly small value of (tx_ring_count - 1) for the ifq length seems to be for no good reason. Subtracting 1 is apparently to increase the chance that all packets in the ifq can be fitted into the tx ring. But this is silly since the ifq packet count is in dufferent units to the buffer descriptor count. For normal-size packets, there are 2 descriptors per packet. So in the usual case where the ifq is full, only about half of it can be moved to the tx ring. And this is good since it gives a little more buffering. Otherwise, the effective buffering is just what is in the tx ring, since none is left in the ifq after transferring eveyrhing. (tx_ring_count - 1) is used by many other drivers. E.g., fxp. fxp is only 100 Mbps and its tx_ring_count is 128. 128 is a bit larger than 50 but not enough. Scaling down my 40000 gives 4000 for hz = 100 and 400 for hz = 1000. I never worried about this problem at 100 Mpbs. Changing from 2 rings of length 255 to 8 of length 255 shouldn't make much difference if other things are configured correctly. It doesn't matter much if the buffering is in ifq or in ring buffers. Multiple ring buffers, filled in advance of the active one running dry so that the next one can be switched to quickly, mainly give you a tiny latency optimization. I get similar optimizations more in software for bge, by doing watermark stuff. The boundary between the ifq and the tx ring also acts as a primitive watermark. With watermarks, it is best to not divide up the buffer evenly, but that is what the (tx_ring_count - 1) sizing for the ifq sort of does. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140719053318.I15959>