From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 01:34:42 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C509D7FD; Thu, 30 Jan 2014 01:34:42 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 95EC91246; Thu, 30 Jan 2014 01:34:41 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s0U1YZl6008675 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 29 Jan 2014 17:34:35 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s0U1YYV7008674; Wed, 29 Jan 2014 17:34:34 -0800 (PST) (envelope-from jmg) Date: Wed, 29 Jan 2014 17:34:34 -0800 From: John-Mark Gurney To: Adrian Chadd , Garrett Wollman , FreeBSD Net Subject: Re: Big physically contiguous mbuf clusters Message-ID: <20140130013434.GP93141@funkthat.com> Mail-Followup-To: Adrian Chadd , Garrett Wollman , FreeBSD Net References: <21225.20047.947384.390241@khavrinen.csail.mit.edu> <20140129231121.GA18434@ox> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140129231121.GA18434@ox> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Wed, 29 Jan 2014 17:34:35 -0800 (PST) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 01:34:42 -0000 Navdeep Parhar wrote this message on Wed, Jan 29, 2014 at 15:11 -0800: > On Wed, Jan 29, 2014 at 02:21:21PM -0800, Adrian Chadd wrote: > > Hi, > > > > On 29 January 2014 10:54, Garrett Wollman wrote: > > > Resolved: that mbuf clusters longer than one page ought not be > > > supported. There is too much physical-memory fragmentation for them > > > to be of use on a moderately active server. 9k mbufs are especially > > > bad, since in the fragmented case they waste 3k per allocation. > > > > I've been wondering whether it'd be feasible to teach the physical > > memory allocator about >page sized allocations and to create zones of > > slightly more physically contiguous memory. > > I think this would be very useful. For example, a zone_jumbo32 would > hit a sweet spot -- enough to fit 3 jumbo frames and some loose change > for metadata. I'd like to see us improve our allocators and VM system Actually, that is what currently happens... I just verified this on -current... http://fxr.watson.org/fxr/source/vm/uma_core.c#L880 is where the allocation happens, notice the uk_ppera, and kgdb says: print zone_jumbo9[0].uz_kegs.lh_first[0].kl_keg[0].uk_ppera $7 = 3 > to work better with larger contiguous allocations, rather than > deprecating the larger zones. It seems backwards to push towards > smaller allocation units when installed physical memory in a typical > system continues to rise. > > Allocating 3 x 4K instead of 1 x 9K for a jumbo means 3x the number of > vtophys translations, 3x the phys_addr/len traffic on the PCIe bus I don't think that this will be an issue.. If we support a 9k jumbo that is not physically contiguous (easy on main memory), it's likely that the table we use to fetch the first physical page will likely have the next two pages in it, so I doubt there will be that significant performance penalty, yes, we'll loop a few more times, but main memory accesses is more the speed limiter in these situations... > (scatter list has to be fed to the chip and now it's 3x what it has to > be), 3x the number of "wrapper" mbuf allocations (one for each 4K > cluster) which will then be stitched together to form a frame, etc. etc. And what is that in percentage of overall traffic? .4% (assuming 16 bytes per 4k page)... If your PCIe bus is saturating and you need that extra .4% traffic, then you have a serious issue w/ your bus layout... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not."