From owner-freebsd-net@FreeBSD.ORG Fri Mar 8 07:10:43 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 9108689A; Fri, 8 Mar 2013 07:10:43 +0000 (UTC) (envelope-from wollman@hergotha.csail.mit.edu) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) by mx1.freebsd.org (Postfix) with ESMTP id 24604CEF; Fri, 8 Mar 2013 07:10:42 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.5/8.14.5) with ESMTP id r287AfnT054755; Fri, 8 Mar 2013 02:10:41 -0500 (EST) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.5/8.14.4/Submit) id r287AfKg054752; Fri, 8 Mar 2013 02:10:41 -0500 (EST) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <20793.36593.774795.720959@hergotha.csail.mit.edu> Date: Fri, 8 Mar 2013 02:10:41 -0500 From: Garrett Wollman To: freebsd-net@freebsd.org Subject: Limits on jumbo mbuf cluster allocation X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (hergotha.csail.mit.edu [127.0.0.1]); Fri, 08 Mar 2013 02:10:42 -0500 (EST) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: jfv@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Mar 2013 07:10:43 -0000 I have a machine (actually six of them) with an Intel dual-10G NIC on the motherboard. Two of them (so far) are connected to a network using jumbo frames, with an MTU a little under 9k, so the ixgbe driver allocates 32,000 9k clusters for its receive rings. I have noticed, on the machine that is an active NFS server, that it can get into a state where allocating more 9k clusters fails (as reflected in the mbuf failure counters) at a utilization far lower than the configured limits -- in fact, quite close to the number allocated by the driver for its rx ring. Eventually, network traffic grinds completely to a halt, and if one of the interfaces is administratively downed, it cannot be brought back up again. There's generally plenty of physical memory free (at least two or three GB). There are no console messages generated to indicate what is going on, and overall UMA usage doesn't look extreme. I'm guessing that this is a result of kernel memory fragmentation, although I'm a little bit unclear as to how this actually comes about. I am assuming that this hardware has only limited scatter-gather capability and can't receive a single packet into multiple buffers of a smaller size, which would reduce the requirement for two-and-a-quarter consecutive pages of KVA for each packet. In actual usage, most of our clients aren't on a jumbo network, so most of the time, all the packets will fit into a normal 2k cluster, and we've never observed this issue when the *server* is on a non-jumbo network. Does anyone have suggestions for dealing with this issue? Will increasing the amount of KVA (to, say, twice physical memory) help things? It seems to me like a bug that these large packets don't have their own submap to ensure that allocation is always possible when sufficient physical pages are available. -GAWollman