From owner-freebsd-net@FreeBSD.ORG  Mon Sep 27 12:55:46 2010
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2F8661065679
	for <net@freebsd.org>; Mon, 27 Sep 2010 12:55:46 +0000 (UTC)
	(envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 3F9238FC0A
	for <net@freebsd.org>; Mon, 27 Sep 2010 12:55:43 +0000 (UTC)
Received: (qmail 81613 invoked from network); 27 Sep 2010 12:48:17 -0000
Received: from localhost (HELO [127.0.0.1]) ([127.0.0.1])
	(envelope-sender <andre@freebsd.org>)
	by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
	for <rizzo@iet.unipi.it>; 27 Sep 2010 12:48:17 -0000
Message-ID: <4CA09451.7010401@freebsd.org>
Date: Mon, 27 Sep 2010 14:55:45 +0200
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
	rv:1.9.2.9) Gecko/20100825 Thunderbird/3.1.3
MIME-Version: 1.0
To: Luigi Rizzo <rizzo@iet.unipi.it>
References: <4C9DA26D.7000309@freebsd.org> <4C9DB0C3.5010601@freebsd.org>
	<20100925163010.GA76213@onelab2.iet.unipi.it>
In-Reply-To: <20100925163010.GA76213@onelab2.iet.unipi.it>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: FreeBSD Net <net@freebsd.org>
Subject: Re: mbuf changes
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Sep 2010 12:55:46 -0000

On 25.09.2010 18:30, Luigi Rizzo wrote:
> On Sat, Sep 25, 2010 at 10:20:19AM +0200, Andre Oppermann wrote:
>> On 25.09.2010 09:19, Julian Elischer wrote:
>>> over the last few years there has been a bit of talk about some changes
>>> people want to see in mbufs
>>> for 9.x
>>> extra fields, changes in the way things are done, etc.
>>>
>>> If you are one of these people, pipe up now..
>>>
>>> to get the ball rolling..
>>>
>>> * Add a field for the current FIB.. currently this is 4 bits stolen from
>>> the flags.
>>> what would be a good width: 8,12,16,24,32 bits?
>>> this would allow setfib to use numbers greater than 16 (the current max)
>>
>> 16 bits for 65535 FIB's should be sufficient.  More than that seems really
>> excessive.
>>
>>> * Preallocating some room for some number of tags before we start
>>> allocating
>>> (expensively) new ones.
>>
>> Within the mbuf?  Or at external and attached mbuf allocation time?  Tags
>> are variable width and such not really suitable for pre-allocation.
>
> my idea was to have an extra field in the mbuf to tell how much room
> should be reserved/used for metadata (such as mtags) after
> the payload area so you don't need to change the allocator, and
> possibly can even modify this on an existing mbuf.
> Almost always mbufs have spare room (e.g. incoming pkts have all
> data in the cluster and mostly empty mdata; outgoing, except
> for rare cases, tend to be in a similar situation.
> So this approach would allow to take an already allocated
> mbuf and put the mtag in the spare area after the data.

For incoming data this approach could work as usually 2K mbuf clusters
are used and they have trailing space available, or rather the normal
mbuf referencing the cluster doesn't have its own data section unused.

When trailing space should be used the M_TAILINGSPACE() needs modifications
and a full tree audit is required to make sure that all mbuf consumers are
correctly using it and not some own version that directly assumes certain
mbuf sizes, etc.  A lot of work.

For locally generated mbufs and socket buffers we try to use the mbufs to
their maximal extent.  When the socket buffer data is packetized it normally
is referenced then we get the normal mbuf with its data portion unused.  So
that could work.

A complication is the m_tag_free() field and function which puts the memory
deallocation into the hands of the mtag user.  That means all mtag consumers
have to made aware of provided storage w/o having to return the memory directly
to the memory allocator (malloc/UMA).

So the only way I realistically see is to make use of the mbuf's unused
data portion when it has external storage to it.  This should probably
cover about 98% of all cases.  The rest has to malloc() the mtag storage
as usual.

I could whip up a prototype for review in the next weeks.

>>> * dynamically working out what the front padding size should be.. per
>>> session.. i.e.
> ...
>> We already have "max_linkhdr" that specifies how much space is left
>
> the issue is that this is global (kern.ipc.max_linkhdr)
> but perhaps it would be good if we could make it per-socket
> so either we set it with a setsockopt, or the system can
> adjust back the value for specific sockets once it detects
> that it needs extra room
> (if you make the default too large, the useful room in the mdata
> area becomes too small unless perhaps we move to 512-bytes mbufs

For most protocols it is only necessary to leave enough space in
the mbuf to fit their header.  In the case of TCP this is currently
100 bytes.  That leaves 156 bytes for the mbuf header and prepend
space.  TCP can easily be modified to allocate an mbuf cluster when
the prepend space is too large. IIRC UDP already get this right.

The setsockopt route is suboptimal because it requires the application
know about all the encapsulation steps that may happen to its packets.

A backcall to update the prepend value is also very difficult because
it has to be implemented for every protocol.

Taking into consideration that prepending and encapsulation is limited
to some reasonable amount (you don't want to have half of your MTU sized
packet to be additional encapsulation headers) I'd say the global
max_linkhdr is sufficient.

-- 
Andre