From owner-freebsd-net@FreeBSD.ORG Fri Mar 8 07:54:23 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 61106F88 for ; Fri, 8 Mar 2013 07:54:23 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id D7213EFC for ; Fri, 8 Mar 2013 07:54:22 +0000 (UTC) Received: (qmail 52521 invoked from network); 8 Mar 2013 09:07:40 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 8 Mar 2013 09:07:40 -0000 Message-ID: <51399926.6020201@freebsd.org> Date: Fri, 08 Mar 2013 08:54:14 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: Garrett Wollman Subject: Re: Limits on jumbo mbuf cluster allocation References: <20793.36593.774795.720959@hergotha.csail.mit.edu> In-Reply-To: <20793.36593.774795.720959@hergotha.csail.mit.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: jfv@freebsd.org, freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Mar 2013 07:54:23 -0000 On 08.03.2013 08:10, Garrett Wollman wrote: > I have a machine (actually six of them) with an Intel dual-10G NIC on > the motherboard. Two of them (so far) are connected to a network > using jumbo frames, with an MTU a little under 9k, so the ixgbe driver > allocates 32,000 9k clusters for its receive rings. I have noticed, > on the machine that is an active NFS server, that it can get into a > state where allocating more 9k clusters fails (as reflected in the > mbuf failure counters) at a utilization far lower than the configured > limits -- in fact, quite close to the number allocated by the driver > for its rx ring. Eventually, network traffic grinds completely to a > halt, and if one of the interfaces is administratively downed, it > cannot be brought back up again. There's generally plenty of physical > memory free (at least two or three GB). You have an amd64 kernel running HEAD or 9.x? > There are no console messages generated to indicate what is going on, > and overall UMA usage doesn't look extreme. I'm guessing that this is > a result of kernel memory fragmentation, although I'm a little bit > unclear as to how this actually comes about. I am assuming that this > hardware has only limited scatter-gather capability and can't receive > a single packet into multiple buffers of a smaller size, which would > reduce the requirement for two-and-a-quarter consecutive pages of KVA > for each packet. In actual usage, most of our clients aren't on a > jumbo network, so most of the time, all the packets will fit into a > normal 2k cluster, and we've never observed this issue when the > *server* is on a non-jumbo network. > > Does anyone have suggestions for dealing with this issue? Will > increasing the amount of KVA (to, say, twice physical memory) help > things? It seems to me like a bug that these large packets don't have > their own submap to ensure that allocation is always possible when > sufficient physical pages are available. Jumbo pages come directly from the kernel_map which on amd64 is 512GB. So KVA shouldn't be a problem. Your problem indeed appears to come physical memory fragmentation in pmap. There is a buddy memory allocator at work but I fear it runs into serious trouble when it has to allocate a large number of objects spanning more than 2 contiguous pages. Also since you're doing NFS serving almost all memory will be in use for file caching. Running a NIC with jumbo frames enabled gives some interesting trade- offs. Unfortunately most NIC's can't have multiple DMA buffer sizes on the same receive queue and pick the best size for the incoming frame. That means they need to use largest jumbo mbuf for all receive traffic, even a tiny 40 byte ACK. The send side is not constrained in such a way and tries to use PAGE_SIZE clusters for socket buffers whenever it can. Many, but not all, NIC's are able to split a received jumbo frame into multiple smaller DMA segments forming an mbuf chain. The ixgbe hardware is capable of doing this, though the driver supports it but doesn't actively makes use of it. Another issue with many drivers is their inability to deal with mbuf allocation failure for their receive DMA ring. They try to fill it up to the maximal ring size and balk on failure. Rings have become very big and usually are a power of two. The driver could function with a partially filled RX ring too, maybe with some performance impact when it gets really low. On every rxeof it tries to refill the ring, so when resources become available again it'd balance out. NIC's with multiple receive queues/rings make this problem even more acute. A theoretical fix would be to dedicate an entire super page of 1GB or so exclusively to the jumbo frame UMA zone as backing memory. That memory is gone for all other uses though, even if not actually used. Allocating the superpage and determining its size would have to be done manually by setting loader variables. I don't see a reasonable way to do this with autotuning because it requires advance knowledge of the usage patters. IMHO the right fix is to strongly discourage use of jumbo clusters larger than PAGE_SIZE when the hardware is capable of splitting the frame into multiple clusters. The allocation constraint then is only available memory and no longer contiguous pages. Also the waste factor for small frames is much lower. The performance impact is minimal to non-existent. In addition drivers shouldn't break down when the RX ring can't be filled to the max. I recently got yelled at for suggesting to remove jumbo > PAGE_SIZE. However your case proves that such jumbo frames are indeed their own can of worms and should really only and exclusively be used for NIC's that have to do jumbo *and* are incapable of RX scatter DMA. -- Andre