From owner-freebsd-net@FreeBSD.ORG  Thu Jan 30 01:34:42 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id C509D7FD;
 Thu, 30 Jan 2014 01:34:42 +0000 (UTC)
Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 95EC91246;
 Thu, 30 Jan 2014 01:34:41 +0000 (UTC)
Received: from h2.funkthat.com (localhost [127.0.0.1])
 by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s0U1YZl6008675
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Wed, 29 Jan 2014 17:34:35 -0800 (PST)
 (envelope-from jmg@h2.funkthat.com)
Received: (from jmg@localhost)
 by h2.funkthat.com (8.14.3/8.14.3/Submit) id s0U1YYV7008674;
 Wed, 29 Jan 2014 17:34:34 -0800 (PST) (envelope-from jmg)
Date: Wed, 29 Jan 2014 17:34:34 -0800
From: John-Mark Gurney <jmg@funkthat.com>
To: Adrian Chadd <adrian@freebsd.org>, Garrett Wollman <wollman@csail.mit.edu>,
 FreeBSD Net <freebsd-net@freebsd.org>
Subject: Re: Big physically contiguous mbuf clusters
Message-ID: <20140130013434.GP93141@funkthat.com>
Mail-Followup-To: Adrian Chadd <adrian@freebsd.org>,
 Garrett Wollman <wollman@csail.mit.edu>,
 FreeBSD Net <freebsd-net@freebsd.org>
References: <21225.20047.947384.390241@khavrinen.csail.mit.edu>
 <CAJ-VmomC5Ge3JwfUsgMrJ_rGqiYxfxR4wWzn5A-KAu7HBsueMw@mail.gmail.com>
 <20140129231121.GA18434@ox>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20140129231121.GA18434@ox>
User-Agent: Mutt/1.4.2.3i
X-Operating-System: FreeBSD 7.2-RELEASE i386
X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88  9322 9CB1 8F74 6D3F A396
X-Files: The truth is out there
X-URL: http://resnet.uoregon.edu/~gurney_j/
X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html
X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE
X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger?
X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2
 (h2.funkthat.com [127.0.0.1]); Wed, 29 Jan 2014 17:34:35 -0800 (PST)
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Jan 2014 01:34:42 -0000

Navdeep Parhar wrote this message on Wed, Jan 29, 2014 at 15:11 -0800:
> On Wed, Jan 29, 2014 at 02:21:21PM -0800, Adrian Chadd wrote:
> > Hi,
> > 
> > On 29 January 2014 10:54, Garrett Wollman <wollman@csail.mit.edu> wrote:
> > > Resolved: that mbuf clusters longer than one page ought not be
> > > supported.  There is too much physical-memory fragmentation for them
> > > to be of use on a moderately active server.  9k mbufs are especially
> > > bad, since in the fragmented case they waste 3k per allocation.
> > 
> > I've been wondering whether it'd be feasible to teach the physical
> > memory allocator about >page sized allocations and to create zones of
> > slightly more physically contiguous memory.
> 
> I think this would be very useful.  For example, a zone_jumbo32 would
> hit a sweet spot -- enough to fit 3 jumbo frames and some loose change
> for metadata.  I'd like to see us improve our allocators and VM system

Actually, that is what currently happens...  I just verified this on
-current...

http://fxr.watson.org/fxr/source/vm/uma_core.c#L880

is where the allocation happens, notice the uk_ppera, and kgdb says:
print zone_jumbo9[0].uz_kegs.lh_first[0].kl_keg[0].uk_ppera
$7 = 3

> to work better with larger contiguous allocations, rather than
> deprecating the larger zones.  It seems backwards to push towards
> smaller allocation units when installed physical memory in a typical
> system continues to rise.
> 
> Allocating 3 x 4K instead of 1 x 9K for a jumbo means 3x the number of
> vtophys translations, 3x the phys_addr/len traffic on the PCIe bus

I don't think that this will be an issue.. If we support a 9k jumbo
that is not physically contiguous (easy on main memory), it's likely
that the table we use to fetch the first physical page will likely have
the next two pages in it, so I doubt there will be that significant
performance penalty, yes, we'll loop a few more times, but main memory
accesses is more the speed limiter in these situations...

> (scatter list has to be fed to the chip and now it's 3x what it has to
> be), 3x the number of "wrapper" mbuf allocations (one for each 4K
> cluster) which will then be stitched together to form a frame, etc. etc.

And what is that in percentage of overall traffic?  .4% (assuming 16 bytes
per 4k page)...  If your PCIe bus is saturating and you need that extra
.4% traffic, then you have a serious issue w/ your bus layout...

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."